Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-04-27
2003-11-04
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S031000
Reexamination Certificate
active
06643802
ABSTRACT:
BACKGROUND
The invention relates to storing information in response to a fault occurring in a parallel processing system.
Software in a computer system may be made up of many layers. The highest layer is usually referred to as the application layer, followed by lower layers that include the operating system, device drivers (which usually are part of the operating system), and other layers. In a system that is coupled to a network, various transport and network layers may also be present.
During execution of various software routines or modules in the several layers of a system, errors or faults may occur. Such faults may include addressing exceptions, arithmetic faults, and other system errors. A fault handling mechanism is needed to handle such faults so that a software routine or module or even the system can shut down gracefully. For example, clean-up operations may be performed by the fault handling mechanism, and may include the deletion of temporary files and freeing up of system resources. In many operating systems, exception handlers are provided to handle various types of faults (or exceptions). For example, exception handlers are provided in WINDOWS® operating systems and in UNIX operating systems.
Software may be run on single processor systems, multiprocessor systems, or multi-node parallel processing systems. Examples of single processor systems include standard desktop or portable systems. A multiprocessor system may include a single node that includes multiple processors running in the node. Such systems may include symmetric multiprocessor (SMP) systems. A multi-node parallel processing system may include multiple nodes that may be connected by an interconnect network.
Faults may occur during execution of software routines or modules in each node of a multi-node parallel processing system. When a fault occurs in a multi-node parallel processing system, it may be desirable to capture the state of each node in the system. A need thus exists for a method and apparatus for coordinating the handling of faults occurring in a system having multiple nodes.
SUMMARY
In general, according to one embodiment, a method of handling faults in a system having plural nodes. Includes detecting a fault condition in the system and starting fault handling routine in each of the nodes. Selected information collected by each of the fault handling routines is communicated to a predetermined one of the plural nodes.
Other features and embodiments will become apparent from the following description, from the drawings, and from the claims.
REFERENCES:
patent: 5046068 (1991-09-01), Kubo et al.
patent: 5056091 (1991-10-01), Hunt
patent: 5253359 (1993-10-01), Spix et al.
patent: 5303383 (1994-04-01), Neches et al.
patent: 5371883 (1994-12-01), Gross et al.
patent: 5485573 (1996-01-01), Tandon
patent: 5537535 (1996-07-01), Maruyama et al.
patent: 5619644 (1997-04-01), Crockett et al.
patent: 5640584 (1997-06-01), Kandasamy et al.
patent: 5642478 (1997-06-01), Chen et al.
patent: 5664093 (1997-09-01), Barnett et al.
patent: 5699505 (1997-12-01), Srinivasan
patent: 5774645 (1998-06-01), Beaujard et al.
patent: 5845062 (1998-12-01), Branton et al.
patent: 5872904 (1999-02-01), McMillen et al.
patent: 5884019 (1999-03-01), Inaho
patent: 5961642 (1999-10-01), Lewis
patent: 6000040 (1999-12-01), Culley et al.
patent: 6000046 (1999-12-01), Passmore
patent: 6065136 (2000-05-01), Kuwabara
patent: 6105150 (2000-08-01), Noguchi et al.
patent: 6289379 (2001-09-01), Urano et al.
patent: 6430712 (2002-08-01), Lewis
patent: 6470388 (2002-10-01), Niemi et al.
Calkins Dennis R.
Cochran Nancy J.
Frost Bruce J.
Geisert Mark A.
Hsieh Carl Chih-Fen
Beausoliel Robert
Chu Gabriel
NCR Corporation
Trop, Pruner & Hu P. C.
LandOfFree
Coordinated multinode dump collection in response to a fault does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Coordinated multinode dump collection in response to a fault, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Coordinated multinode dump collection in response to a fault will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3166680