Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-05-02
2001-02-20
Beausoliel, Jr., Robert W. (Department: 2785)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
Reexamination Certificate
active
06192489
ABSTRACT:
BACKGROUND OF THE INVENTION
A. Field of the Invention
The present invention relates to error recovery in computer systems. More particularly, the present invention relates to recovery from processing errors caused by AC or timing dependent defects.
B. Related Art
The Unscheduled Incident Repair Action (UIRA) is perhaps the single most important Reliability, Availability and Serviceability (RAS) characteristic. UIRA's are caused by a non-recoverable failure in a critical hardware function which results in the need to bring a customer's system down for repair at an unscheduled time. Circuit failures causing UIRA's can be either AC or DC in nature. DC defects are solid failures which occur whenever a defective circuit is used. AC defects are typically timing dependent and show up only when a timing margin in a logic path is exceeded.
Self-test mechanisms that can distinguish AC defects from DC defects are known in the art. For example, in cases where logic fails a self-test at a first clock speed, it is known in the art to rerun the self-test at a lower clock speed to determine whether the failure was caused by an AC defect or a DC defect. If the self-test passes at the lower clock speed, the failure is identified as having been caused by an AC defect. If the self-test does not pass at the lower clock speed, the failure is identified as being caused by an DC defect. An article entitled “SELF-TEST AC ISOLATION” (IBM Technical Disclosure Bulletin Vol. 28, No. 1, June 1985, pp. 49-51) describes a method to identify the initiating clock pulse of an AC failure, to identify the capturing clock pulse, to identify the capturing storage elements, and to extract the hardware states just prior to and just after the failure for further diagnosis.
While the above test methods provide a means for distinguishing AC defects from DC defects and for fault isolation within a test fixture environment, they do not solve the problem of providing dynamic error recovery or fault tolerance from processing errors caused by AC defects.
Prior art computer systems have been provided with a variety of mechanisms for recovering from processing errors. For example, U.S. Pat. No. 4,912,707 to Kogge et al discloses the use of a checkpoint retry mechanism which enables the retry of instruction sequences for segments of recently executed code, in response to detection of an error since the passage of a current checkpoint. Another example of an instruction retry mechanism is disclosed in U.S. Pat. No. 4,044,337 to Hicks et al.
While such prior art retry mechanisms provide a good means for recovery from soft errors (errors occurring because of electrical noise or other randomly occurring sources which result in non-reproducible fault syndromes), they do not provide recovery from solid or hard errors caused by AC defects (i.e. timing errors which are recurring and consistently reproducible).
Another prior art mechanism for handling processing errors involves the use of redundant processing elements. In such systems, identical instruction streams are processed in parallel by two or more processing elements. When an unrecoverable error is detected in one of the processing elements, it is taken off-line and the other processing element continues to process the instruction stream. One advantage of such redundant processor schemes is that they can handle both “soft” and “solid” or “hard” errors. The disadvantage of such schemes is that providing duplicate processing elements to increase “fault tolerance” significantly increases the cost of the system in terms of parts and manufacture.
Thus, what is needed is an inexpensive mechanism to enable an otherwise conventional computer system to dynamically recover from AC defects.
SUMMARY OF THE INVENTION
The present invention comprises a mechanism for handling processing errors caused by AC defects in a computer system. The mechanism includes a first means for processing a stream of instructions, second means for detecting a timing dependent error occurring during processing of the instruction by the first means and third means for varying the instruction processing cycle time of the first means in response to the detection of a timing dependent error by the second means, and for causing the second means to retry at least a portion of the instruction subsequent to the varying.
In a preferred embodiment, the present invention uses a variable frequency oscillator, controlled by recovery code, to increase the system clock cycle time by a specified time (Textend) following what has been determined to be a critical fail and after normal retry has been unsuccessful. The increased cycle time extends the logic path timing slack and, thereby, provides tolerance to certain AC (path delay) defects which have developed in any cycle time dependent latch to latch segment. The time (Textend) is chosen based on maximum cycle time restrictions resulting, for example, from the pipelining of data in system cables.
Successful retry at increased (extended) cycle time means that the defect was time dependant and tolerated by the cycle time extension (Textend). It still results in a service call for deferred repair, but the system can remain up and running. Unsuccessful retry at increased cycle time means that the defect was solid (DC), or AC with a timing characteristic longer than the cycle time extension (Textend). In such instances, an unsuccessful retry results in a UIRA which brings the system down and initiates a service call for immediate repair.
The present invention can be extended to provide data in an error reporting file which can be used to assist manufacturing/repair in defect analysis of the failing hardware. Often, the testing of liquid cooled modules (TCMs) returned from field repair results in a report of “No Defect Found” (NDF). NDFs can be caused by AC defects in TCM to TCM nets which, because of circuit timings, only appear when a failing unit is in place in a customer's machine. Having data in the repair message which identifies that the defect is time dependent and tolerated by the cycle time extension (Textend) can assist in defect isolation and identification.
REFERENCES:
patent: 3548177 (1970-12-01), Haitlipp et al.
patent: 3868647 (1975-02-01), Zandveld
patent: 4003086 (1977-01-01), Larsen et al.
patent: 4025768 (1977-05-01), Missios et al.
patent: 4044337 (1977-08-01), Hicks et al.
patent: 4412281 (1983-10-01), Works
patent: 4481575 (1984-11-01), Bazlen et al.
patent: 4800564 (1989-01-01), DeFazio et al.
patent: 4912707 (1990-03-01), Kogge et al.
patent: 5872907 (1999-02-01), Griess et al.
IBM Technical Disclosure Bulletin, vol. 29, No. 2, Jul. 1986, pp. 903-904; “Clock Recovery . . . Counter”.
IBM Technical Disclosure Bulletin, vol. 28, No. 1, Jun. 1985; pp. 49-51; “Self Test AC Isolation”.
IBM Technical Disclosure Bulletin, vol. 27, No. 4B, Sep. 1984, pp. 2509-2510; “High Speed Programmable Clock Generator”.
IEEE Spectrum, Feb. 1984, pp. 36-42; “Maintenance processors for Mainframe Computers” by T. Liu.
IBM Technical Disclosure Bulletin, vol. 21, No. 4, Sep. 1978; “Retry with Performance Degradation”.
Griess Kevin Roy
Merenda Ann Caroline
Pierce Donald Lloyd
Baderman Scott T.
Beausoliel, Jr. Robert W.
Cutter, Esq. Lawrence D.
International Business Machines - Corporation
McGuireWoods LLP
LandOfFree
Fault tolerant design for identification of AC defects... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Fault tolerant design for identification of AC defects..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fault tolerant design for identification of AC defects... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2608899