Global hard error distribution using the SCI interconnect

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S048000

Reexamination Certificate

active

06175931

ABSTRACT:

TECHNICAL FIELD OF THE INVENTION
This invention relates in general to error management in a multi-node, multi-processor system and more particularly to distributing error information throughout such a system on a node-by-node basis.
BACKGROUND OF THE INVENTION
When an error occurs on one node of a Scalable Coherent Interface (SCI) system, it can cause other errors which quickly propagate through the complete system, giving rise to a number of error signals at different nodes, all perhaps stemming from the original error condition. Each of these errors must be cleared before the system is back to full health. However, it can be very difficult to determine which of the many error signals represent the original error and which error signals are derivative therefrom.
An example of this situation is a simple timeout at a particular node. The local node detects the timeout error and logs it. In the meantime, a remote node could also be attempting to access that same memory location. The remote node then also will log a timeout error. From the perspective of the remote node, it is not easy to know if the memory at the target node is not working properly or if the linkage connecting the remote node to the target memory is not functioning properly.
The most important thing in debugging this particular error would be to know that the local error was logged first.
This sequencing of errors is not currently possible since the errors are logged without time information being associated therewith. The error is just a bit logged in a certain location at a node. Using a time stamp for each error would be very difficult because it would involve synchronizing different clocks and different nodes to a very high accuracy, perhaps even down to nanoseconds. The overhead involved with such a system would be prohibitive.
Thus a need exists in the art for a system and method for isolating errors that occur at one node of a multi-node system but which can cause error conditions to be logged at multiple other nodes.
A further need in the art exists for such a system which does not significantly increase the overhead with respect to such logged errors.
A still further need exists in the art for establishing a system and method for determining the order of occurrence of errors which can occur at multiple nodes as a result of an error condition at one of the nodes.
SUMMARY OF THE INVENTION
Typically, an SCI system has error signals that connect various functional units, such as the PAC, RAC, MAC and TAC. These signals are bidirectional such that, when one functional unit determines an error condition, an error signal is sent to all other functional units. The receiving functional unit then logs that an error has been detected at another functional unit.
In one embodiment of the invention one bit of the SCI Idle Symbol is dedicated to propagating the hard error signal from one node to all other nodes in the system. A command and status register at each node is used to control the propagation of this error signal. The system includes the ability to delay the clock stop feature until the error message has been sent to the next node on each of the SCI rings.
The clock stop delay feature can be overridden in situations where it is desired to stop the clock before passing on the error message to a next node. This is important in situations where a delay in stopping clocks can cause loss of a critical state in the TAC.
The system is also designed to work with error containment cluster control systems to prevent errors from being propagated beyond certain established node clusters.
Accordingly, it is one technical advantage of this invention that a multinode, multiprocessor system is equipped with a central control point at each node for controlling error communication both at that node and across the system.
It is a further technical advantage of this invention that the error message can be passed between nodes with a single bit mapped into a high priority protocol on the SCI internodal link.
It is a still further technical advantage of this invention that clocks at each TAC that would normally be stopped upon determination of an error are allowed to run at least long enough so that the error is passed on to the next node.
It is a still further technical advantage of this invention that under certain conditions the clocks at a TAC will stop immediately to preserve the state which is vital to a determination of the root cause of an error.
It is a still further technical advantage of this invention that under certain conditions error signals will be contained within a defined cluster of nodes.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention. It should be appreciated by those skilled in the art that the conception and the specific embodiment disclosed may be readily utilized as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. It should also be realized by those skilled in the art that such equivalent constructions do not depart from the spirit and scope of the invention as set forth in the appended claims.


REFERENCES:
patent: 5331642 (1994-07-01), Valley et al.
patent: 5500944 (1996-03-01), Yoshida
patent: 5542047 (1996-07-01), Armstrogn
patent: 5708775 (1998-01-01), Nakamura
patent: 5768501 (1998-06-01), Lewis
patent: 5777549 (1998-07-01), Arrowsmith et al.
patent: 5790779 (1998-08-01), Ben-Nata et al.
patent: 5799015 (1998-08-01), Bennett et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Global hard error distribution using the SCI interconnect does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Global hard error distribution using the SCI interconnect, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Global hard error distribution using the SCI interconnect will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2445940

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.