Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-05-05
2004-01-06
Iqbal, Nadeem (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S002000, C714S003000
Reexamination Certificate
active
06675315
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to distributed computing systems and, more specifically, to preserving data for diagnosing crashes in such systems.
BACKGROUND OF THE INVENTION
A crash in a computer system is a serious failure in which the computer stops working or a computer program aborts unexpectedly. A crash signifies either a hardware or a software malfunction. Exemplary causes of system crashes include memory access violation, bad pointers, or violation of assertion conditions in a program. Effectively diagnosing a crash is complex, and this complexity is exacerbated in distributed systems in which multiple nodes participate in an operation. This is because, in distributed systems, multiple nodes interface with each other, and a crash on a particular node does not necessarily mean that the cause of the crash originates from that node. The cause of the crash may be, for example, a message that was transmitted to the crashed node and that subsequently causes the crash. In various cases, the sequence of events leading to the crash may spread across numerous nodes. Further, because only one node in the multiple nodes crashes, the non-crashed nodes continue to function and thus change the overall state of the system, which makes it more difficult to identify causes of the crash.
Currently, when a system crashes, diagnostic programs typically perform a “core dump,” which provides information to be analyzed as to the cause of the crash. Such information reflects the system state of the crashed node at the time of crash, addresses of memories, program counters, etc. However, because other nodes interfacing with the crashed node are still functioning, the state of the non-crashed nodes continues to change. Having data from the crashed node is useful, but, in many cases, is not sufficient for identifying the cause of the crash.
Based on the foregoing, it is clearly desirable to provide better techniques for diagnosing crashes in systems in which multiple nodes participate in operations.
SUMMARY OF THE INVENTION
Mechanisms are provided for preserving state information in response to errors that occur in operations in which multiple nodes are participating. In one embodiment, when an error occurs, one or more execution units are suspended. These execution units may be on the node on which the error occurred (the “error node”) and/or on other non-error nodes. In this context, the term “execution unit” refers to a program that executes a particular task. State information is collected from both the suspended execution units and the error node in which the error occurred. All suspended execution units are then released, i.e., allowed to continue execution at the point where the units were suspended. The data collected during suspension is then used for diagnosing the error.
According to one embodiment, the type of error event dictates which execution units to be suspended and the type of information to be collected from the execution units that have been suspended.
In accordance with various embodiments of the invention, suspension of execution units provides a window of opportunity to collect all relevant information necessary for identifying causes of a crash. Further, the collected data are analyzed “off-line,” without affecting usage of the involved system.
REFERENCES:
patent: 4965717 (1990-10-01), Cutts et al.
patent: 5751942 (1998-05-01), Christensen et al.
patent: 5752062 (1998-05-01), Gover et al.
patent: 5884018 (1999-03-01), Jardine et al.
patent: 5928368 (1999-07-01), Jardine et al.
patent: 6002851 (1999-12-01), Basavaiah et al.
patent: 6151689 (2000-11-01), Garcia et al.
Granat Yuriy S.
Lam Ivan Tinlung
Semler Daniel
Srivastava Alok
Bonura Timothy M.
Iqbal Nadeem
Oracle International Corp.
LandOfFree
Diagnosing crashes in distributed computing systems does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Diagnosing crashes in distributed computing systems, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Diagnosing crashes in distributed computing systems will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3259310