Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2001-03-29
2004-12-07
Baderman, Scott (Department: 2114)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S040000, C714S043000, C714S044000
Reexamination Certificate
active
06829729
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to processing systems and more particularly to a fault isolation methodology related to such systems.
BACKGROUND OF THE INVENTION
Conventional computing systems crash when they encounter uncorrectable/unrecoverable data errors (UEs). The impact to the owner of the system can range from being a minor nuisance to severe monetary business losses. Accordingly, a system owner is adversely affected by such system crashes and becomes very dissatisfied by these UEs. Methods to avoid such crashes have both tangible and intangible benefits.
On a conventional multiprocessing computing system platform which includes a service processor, an error classification and processing model is provided whereby the hardware within the central electronic complex notifies a service processor (SP) of conditions requiring processing. An attention signal is provided that informs the SP that such a condition has occurred. The hardware has functions that capture and inform the SP of which type of condition has occurred. In the conventional system there are three (3) possible hardware detected error types:
1. Recovered Error Attention (REA): A hardware detected error condition which the hardware itself recovered from.
2. Special Attention (SA): A hardware detected condition (not necessarily an error) that requires specific unique SP processing actions.
3. Checkstop Attention (CSA): A hardware detected error condition for which hardware caused the system to cease operating (i.e., system crashes).
In this model a given fault or attention condition was designed to be detected and reported from one and only one logical fault source point. A UE in this model was reported as a CSA thereby causing the system hardware to crash immediately. Accordingly, it is desirable to find ways to keep systems functioning as well as possible when UE conditions are encountered. It is also desirable to provide correct fault isolation in a computer system that continues to function while such systems pass the “data with error” through multiple system components on the way to their data destination with various repercussions at each observation point. The present invention addresses such a need.
SUMMARY OF THE INVENTION
A method and system for managing uncorrectable data error conditions from an I/O subsystem as the UE passes through a plurality of devices in a central electronic complex (CEC) is disclosed. The method and system comprises detecting a I/O UE by at least one device in the CEC, and providing an SUE-RE (Special Uncorrectable Data Error-Recoverable Error) attention signal by at least one device to a diagnostic system that indicates the I/O UE condition. The method and system further includes analyzing the SUE-RE attention signal by the diagnostic system to produce an error log with a list of failing parts and a record of the log.
A method and system in accordance with the present invention provides a new fault isolation methodology and algorithm, which extends the current capability of a service processor runtime diagnostic code (PRD). The method allows for the accurate determination of an error source and provides appropriate service action if and when the system fails to recover from the UE condition. This new methodology allows for a more focused determination of error source and for appropriate service action if and when the system fails to recover from an I/O UE.
REFERENCES:
patent: 5193177 (1993-03-01), Burri
patent: 5619642 (1997-04-01), Nielson et al.
patent: 6000040 (1999-12-01), Culley et al.
patent: 6003144 (1999-12-01), Olarig et al.
patent: 6058494 (2000-05-01), Gold et al.
patent: 6105150 (2000-08-01), Noguchi et al.
patent: 6253250 (2001-06-01), Evans et al.
patent: 6574752 (2003-06-01), Ahrens et al.
Bailey Sheldon Ray
Hicks Raymond Leslie
Kitamorn Alongkorn
Baderman Scott
Lohn Joshua
Sawyer Law Group LLP
LandOfFree
Method and system for fault isolation methodology for I/O... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for fault isolation methodology for I/O..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for fault isolation methodology for I/O... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3327814