Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Patent
1998-03-03
2000-03-28
Beausoliel, Jr., Robert W.
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
G06F 1114
Patent
active
060444754
DESCRIPTION:
BRIEF SUMMARY
TECHNICAL FIELD
The present invention relates to a system for checkpointing and restoring the state of a process, and more particularly, to systems for checkpointing and restoring the process state, including lazy checkpoints of the persistent process state, or any specified portion thereof.
BACKGROUND ART
Increasingly, the users of software applications are demanding that the software be resistant, or at least tolerant, to software faults. Users of telecommunication switching systems, for example, demand that the switching systems are continuously available. In addition, where transmissions involve financial transactions, such as for bank automated teller machines, or other sensitive data, customers also demand the highest degree of data consistency.
Thus, a number of software testing and debugging tools have been developed for detecting many programming errors which may cause a fault in a user application process. For example, the Purify.TM. software testing tool, commercially available from Pure Software, Inc., of Sunnyvale, Calif., and described in U.S. Pat. No. 5,193,180, provides a system for detecting memory access errors and memory leaks. The Purify.TM. system monitors the allocation and initialization status for each byte of memory. In addition, for each software instruction that accesses memory, the Purify.TM. system performs a test to ensure that the program is not writing to unallocated memory, and is not reading from uninitialized or unallocated memory.
While software testing and debugging tools, such as the Purify.TM. system, provide an effective basis for detecting many programming errors which may lead to a fault in the user application process, no amount of verification, validation or testing during the software debugging process will detect and eliminate all software faults and give complete confidence in a user application program. Accordingly, residual faults due to untested boundary conditions, unanticipated exceptions and unexpected execution environments have been observed to escape the testing and debugging process and, when triggered during program execution, will manifest themselves and cause the application process to crash or hang, thereby causing service interruption.
It is therefore desirable to provide mechanisms that allow a user application process to recover from a fault with a minimal amount of lost information. Thus, in order to minimize the amount of lost information, a number of checkpointing and restoration techniques have been proposed to recover more efficiently from hardware and software failures. For a general discussion of checkpointing and rollback recovery techniques, see R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Systems," IEEE Trans. Software Eng., Vol. SE-13, No. 1, pp. 23-31 (January 1987). Generally, checkpoint and restoration techniques periodically save the process state during normal execution, and thereafter restore the saved state following a failure. In this manner, the amount of lost work is minimized to progress made by the user application process since the restored checkpoint.
It is noted that the state of a process includes the volatile state as well as the persistent state. The volatile state includes any process information that would normally be lost upon a failure. The persistent state includes all user files that are related to the current execution of the user application process. Although the persistent state is generally not lost upon a failure, it is necessary to restore the persistent state to the same point as the restored volatile state, in order to maintain data consistency.
While existing checkpointing and recovery techniques have adequately addressed checkpointing of the volatile state, these techniques have failed to adequately address checkpointing of the persistent state. According to one approach, all of the persistent state, in other words, all of the user files, are checkpointed with each checkpoint of the volatile state. Clearly, the overhead associated with this technique is prohibitively
REFERENCES:
patent: 4697266 (1987-09-01), Finley
patent: 4814971 (1989-03-01), Thatte
patent: 4819156 (1989-04-01), DeLorme et al.
patent: 4868744 (1989-09-01), Reinsch et al.
patent: 5201044 (1993-04-01), Frey, Jr. et al.
patent: 5235700 (1993-08-01), Alaiwan et al.
patent: 5333303 (1994-07-01), Mohan
patent: 5369757 (1994-11-01), Spiro et al.
patent: 5440726 (1995-08-01), Fuchs et al.
patent: 5530802 (1996-06-01), Fuchs et al.
patent: 5590277 (1996-12-01), Fuchs et al.
Saleh, Kassem et al. "Efficient and Fault-Tolerant Checkpointing Procedures for Distributed Systems," Computers and Communications, 1993 International Phoenix Conference.
Chung Pi-Yu
Huang Yennun
Kintala Chandra
Vo Kiem-Phong
Wang Yi-Min
Baderman Scott T.
Beausoliel, Jr. Robert W.
Lucent Technologies - Inc.
LandOfFree
Checkpoint and restoration systems for execution control does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Checkpoint and restoration systems for execution control, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Checkpoint and restoration systems for execution control will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-1335729