Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-07-03
2001-05-08
Niebling, John F. (Department: 2812)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S013000
Reexamination Certificate
active
06230282
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to fault tolerant computers, and more particularly, to computer systems that utilize a checkpointing error recovery system to recover from a system failure.
BACKGROUND OF THE INVENTION
One type of fault tolerant computer system utilizes a fault detection system that depends on the state of the computer being periodically recorded. In one version of this type of system, the state of the computer is recorded in a second “slave” computer. If an error is detected between checkpoints, the slave computer takes over from the state recorded at the last checkpoint. When a cache line is written into the memory of the “master” computer, the same cache line is copied into a buffer in the slave computer system. At each checkpoint, the contents of the buffer are written into the memory of the slave computer thereby bringing the master and slave memories into synchronization at the checkpoint. If a failure occurs, the slave computer's memory is already synchronized with the master computer at the state that existed at the last check point. Hence, the slave computer can take over the computation starting from that point.
The buffer is typically first-in-first-out (FIFO). The FIFO must be large enough to store all of the writes that occur between checkpoints. If a buffer overflow occurs, the state of the two systems will not be synchronized at the next checkpoint, and the error recovery system will fail. Accordingly, a large FIFO must be utilized. Such a buffer increases the cost of the system.
Unfortunately, there is no guaranteed FIFO size that will guarantee that an overflow will not occur. Consider a case in which the FIFO gradually accumulates data during a checkpoint period. The transfer of the data to the slave memory for this checkpoint period does not start until the checkpoint period is completed. At this point the slave begins to read entries from the FIFO and write those entries into the slave's memory. In the meantime, checkpoint data for the next period is arriving at the FIFO for storage. The FIFO now holds partial checkpoint data for the previous period and the current period. If the inflow rate is particularly high, the FIFO can have more than two intervals worth of data stored in it. The ultimate limit on the rate of data accumulation is determined by the speed at which the slave computer can read the FIFO and then write its main memory. If the applications are generating a series of writes with no intervening memory cycles, the data will accumulate in the FIFO. The extent of the accumulation depends on the density of writes; hence, there is no guaranteed FIFO size that will assure that a failure will not occur. Such a failure would require stopping both machines and copying the master memory in its entirety into the slave memory. Since the memories in question may be quite large, it is advantageous to avoid such system failures.
Broadly, it is the object of the present invention to provide an improved checkpoint memory system.
It is a further object of the present invention to provide a checkpoint memory system that requires less FIFO buffer space than prior art systems.
It is a still further object of the present invention to provide a checkpoint memory system that does not fail if a buffer overflow occurs.
These and other objects of the present invention will become apparent to those skilled in the art from the following detailed description of the invention and the accompanying drawings.
SUMMARY OF THE INVENTION
The present invention is a computer system having a checkpoint error recovery system. The computer system includes a first computer having a first memory and a second computer having a second memory and a buffer. The first and second memories are updated by memory updates that include an address specifying a location and data to be written to the memory receiving the update at the location. The computer system also includes an interface for providing the second computer with a copy of each memory update received by the first memory. Upon receiving each of the copies of the memory updates, the second computer generates a recovery memory update corresponding to that copy of the memory update. The recovery memory update includes the data stored in the second memory at the address specified in the first memory update and the address specified in the received copy. The second computer then updates the second memory using the copy of the memory update, and writes the recovery memory update into the buffer if the buffer does not contain one of the recovery memory updates for the address in the recovery memory update. The second computer empties the buffer upon the receipt of a checkpoint interval signal. The second computer updates the second memory with the recovery memory updates stored in the buffer in response to the receipt of an error signal. The recovery memory updates are performed in the order the recovery memory updates were stored in the buffer.
REFERENCES:
patent: 5381545 (1995-01-01), Baker et al.
patent: 5745672 (1998-04-01), Stiffler
patent: 5913021 (1999-06-01), Masubuchi
patent: 5958070 (1999-09-01), Stiffler
patent: 6079030 (2000-06-01), Masubuchi
Hewlett--Packard Company
Niebling John F.
Whitmore Stacy
LandOfFree
Checkpoint computer system utilizing a FIFO buffer to... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Checkpoint computer system utilizing a FIFO buffer to..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Checkpoint computer system utilizing a FIFO buffer to... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2498952