Hierarchy of fault isolation timers

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S048000

Reexamination Certificate

active

06618825

ABSTRACT:

BACKGROUND
In computer systems which include a plurality of processors and associated peripheral equipment, a mechanism is generally implemented to track the status of requests and responses transmitted throughout a network or complex of such processors and other equipment. This mechanism is generally implemented so that when a request for data is sent, a time limit is imposed on satisfaction of the request by a CPU (Central Processing Unit) based counter. If the request is not satisfied within the time limit, a counter generally times out thereby triggering a high priority machine check which causes the computer network or complex to shut down, and initiating execution of recovery code by the CPU. Generally, the counter starts timing when a request for a transaction wins arbitration and is placed on a bus toward a designated destination within the network or complex. The counter will generally stop either upon successful completion of the task being timed by the counter or upon expiration of the designated time period.
One problem with the above time-out mechanism is that a time out condition generally forces an entire network or complex of connected CPUs, to crash and lose all data associated with the system state in existence prior to the crash. In this situation, system recovery may be accomplished only at a very basic stage with much valuable data having been irretrievably lost due to the time out condition. Furthermore, the centralization of timer operation in the CPU leaves little data with which to identify a source or cause of the error which caused the time out condition. Accordingly, re-occurrence of the event causing the time out may be difficult to prevent.
In another prior art system, timers are added to various system chips in communication with CPUs within a complex of CPUs. Generally, when a time-out condition occurs in such a system chip, a state of the system chip which timed out, at the point in time when the time-out occurred, may be obtained, thereby providing information which may help to identify the cause of the failed transaction leading to the time-out condition. This approach generally provides more guidance in debugging a failure leading to a time out condition than systems employing only CPU-based timers. However, even with implementation of system chip based timers, a time-out condition will generally cause the entire network of computers to crash and lose all information associated with the machine state in existence just prior to the time-out condition. Accordingly, only a very basic recovery operation is available. And, as was the case with the previously discussed CPU-based timer approach, much data is irretrievably lost in upon occurrence of a time-out condition.
Another prior art approach involves implementation of a scalable coherent interface (SCI). SCI includes a networking protocol for retrying certain transactions upon expiration of timers associated with timed transactions. Therefore, instead of crashing the system upon timing out a first time, deployment of SCI protocol may be employed to retry transmission of a request for a which a response was not received in a timely manner. Thus, when a counter times out, the counter may be initialized to zero, the associated request re-transmitted, and the timer enabled to time the retried transaction. This approach may enable certain time out conditions to be avoided where a failure was caused by transient effects with the overall network or complex which do not reoccur during a retried transaction. However, certain problems associated with earlier mentioned approaches remain. Specifically, upon occurrence of a final time-out (for a transaction which will not be retried), the system will generally crash, and data associated with the machine state prior to the crash will generally be irretrievably lost. Accordingly, only a very basic recovery operation will be available.
The timing mechanisms employed in the prior art are generally neither synchronized nor coordinated with each other. Furthermore, the timing mechanisms are generally thinly scattered over a large number of devices, whether the timers are located exclusively in CPUs or are located in a combination of CPUs and system chips, such as memory and input/output (I/O) controllers. Accordingly, a fault in an area of a computer network or complex may go undetected until the problem is substantial enough to cause a widespread system shutdown.
Generally, in prior art systems employing timers distributed among various system chips, the timers generally have closely spaced time-out values. Accordingly, when a fault is encountered, a plurality of different timers may time out asynchronously in close temporal proximity to each other thereby causing the overall system to crash and making subsequent identification of the problem leading to the system crash very difficult. It is further noted that in the prior art systems described above, a time out condition in one CPU or in one system chip may cause an entire complex of CPUs and associated system chips to fail or crash, thereby enabling a failure in 1% of a complex to disrupt operation of 100% of the complex.
Therefore, it is a problem in the art that the machine state of a computer system is lost upon occurrence of a time out condition.
It is a further problem in the art that only a very limited recovery operation is possible after occurrence of a time-out condition.
It is a still further problem in the art that identifying the timer whose expiration caused a system crash may be very difficult in the systems of the prior art.
It is a still further problem in the art that a transaction failure and associated time out condition in one chip of a complex may cause the entire complex to crash or fail.
SUMMARY OF THE INVENTION
These and other objects, features and technical advantages are achieved by a system and method which deploys timers within devices in a distributed manner throughout a system or complex which includes CPUs and associated system chips, where the timers have a hierarchy of time-out values, and where the timers are able to independently experience time-out conditions generating a localized failure condition while enabling a remainder of the complex to continue operating. Preferably, a chip, device, or sub-system affected by the time-out or other error condition continues operating in a degraded or safety mode and communicates its condition to other chips and sub-systems so that the rest of the complex may continue operating while preferably bypassing the chip, device, or sub-system affected by the time-out condition.
The various timing operations preferably operate within a coordinated hierarchical structure wherein each timer monitors an operation occurring below its own level in a hierarchy while also being monitored by a device (which may be a timer) at a higher level in the hierarchy, where the higher level device (whether timer, CPU or other device) is generally able to monitor the timers below its level for a time period exceeding the time-out value of the timer being so monitored. In this manner, a time-out condition of a timer at one level in the hierarchy may be detected at the next higher level in the hierarchy thereby enabling the higher level device to respond to a time-out condition in a pre-determined and controlled manner, thereby enabling the higher level device to preserve its own data, preserve control over its own operation and beneficially isolate the error condition to the lower level device or system, thereby avoiding a shutdown of an entire complex or system.
Since the equipment affected by a time-out condition preferably continues operating, albeit in a degraded mode, during the time-out, and the rest of the complex may continue operating substantially normally, the complex is preferably able to preserve system state information which existed prior to the time-out condition and to continue processing information associated with the system state. Moreover, since the chip or device affected by the time-out continues operating after the time-out, and is able t

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Hierarchy of fault isolation timers does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Hierarchy of fault isolation timers, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Hierarchy of fault isolation timers will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3020949

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.