Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-06-26
2001-03-06
Hua, Ly V. (Department: 2785)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S012000, C714S021000
Reexamination Certificate
active
06199171
ABSTRACT:
RELATED APPLICATIONS
The present application is related to co-pending application Ser. No. 08/929,014, entitled “METHOD AND SYSTEM FOR FAULT-HANDLING TO IMPROVE RELIABILITY OF A DATA-PROCESSING SYSTEM”, filed on Sep. 15, 1997, assigned to the assignee of the present application and included herein by reference.
1. Field of the Invention
The present invention relates generally to information processing systems and more particularly to a methodology and system for handling detected faults in a processor.
2. Background of the Invention
As personal computers and workstations are utilized to perform more and more substantial applications that were formerly reserved for mainframes, system availability and data integrity become increasingly important. In the prior art, a technique known as lock-step duplexing has been utilized to assure data integrity in lower priced computers. With lock-step duplexing, two processing elements are utilized for fault detection and when a mismatch is found between the two processing elements, the computer system immediately comes to a halt. In certain aspects, this is a very safe methodology as it assumes that all occurred errors are permanent. But at the same time, the associated cost of this methodology can also be very high because there is usually a long downtime for each outage. This is particularly true when the majority of errors that occurred in the field are transient in nature, making such methodology seemingly overly conservative.
As an improvement, some lock-step duplexing systems are enhanced by utilizing a “retry.” More specifically, if there is a mismatch, both processing elements are retried and the result comparison is performed again. The computer system will be halted when there is a second mismatch. Accordingly, the technique of lock-step duplexing with retry can be utilized in fault detection and recovery for transient errors also. Due to the high occurrence rate of transient errors, lock-step duplexing systems with retry tend to have higher system availability than lock-step duplexing systems without retry. Still, there is a concern about data integrity exposures in all systems that are based on lock-step duplexing technique. Such concern stems from common-mode errors.
Common-mode errors (either permanent or transient), which may occur in any peripheral component of the computer system, such as memory, bus, etc., can potentially feed both lock-stepped processing elements with the same bad data and cause a data integrity violation without being detected.
Moreover, prior systems which have been implemented for error detection and recovery methodologies in systems where transactions can be loaded directly from I/O devices in non-batch mode operations are not necessarily applicable to batch mode operations.
Accordingly, there is a need for an improved and yet reasonably economical method and system for the detection, reporting, and recovery of transient errors in computer systems.
SUMMARY OF THE INVENTION
A method and apparatus is provided which enables processor error detection and handling in both batch and non-batch mode computer systems. An exemplary embodiment includes a first processor, a second processor, an I/O processor and a comparator. The leading processor uses a write check buffer in the I/O processor to temporarily store write requests. The lagging processor does only pseudo write operations by writing to its own private write buffer. After a predetermined interval, the write requests for both the leading and lagging processors are committed by flushing to disk. At flush time, the entries of the lagging processor's write buffer are compared with the I/O processor's public write check buffer. If a mismatch between the buffer entries is indicated, the respective transactions are marked as corrupted and are scheduled for re-execution.
REFERENCES:
patent: 5446872 (1995-08-01), Ayres et al.
patent: 5491792 (1996-02-01), Grisham et al.
patent: 5608866 (1997-03-01), Horikawa
patent: 6058491 (2000-05-01), Bossen et al.
Bossen Douglas Craig
Chandra Arun
DeFrank Edmond A.
Emile Volel
Hua Ly V.
International Business Machines - Corporation
LandOfFree
Time-lag duplexing techniques does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Time-lag duplexing techniques, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Time-lag duplexing techniques will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2463100