Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-06-29
2004-02-10
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S026000, C714S047300, C714S048000, C714S015000, C714S037000
Reexamination Certificate
active
06691250
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to computer system software for handling faults, resulting from logic and coding errors, corrupted states in memory, and other hardware failures, that can cause a computer system to crash. More specifically, the invention relates to a virtual machine used for the diagnosis of and recovery from such faults.
2. Discussion of Related Art
Since the time computers were being used in commercial and non-commercial settings on any scale, devising fault-tolerant computer systems has been an important and constantly evolving area in computer science. As computers are used more and more in environments where failures must be avoided as much as possible, fault-tolerant systems have been further developed to best handle unexpected system failures. With current fault tolerant systems, fault diagnosis and fault recovery have generally been separated or isolated from each other. Determining that a fault occurred because of a logic error or a corrupted memory state is a distinct task from actually recovering the system to a normal state so processing can continue. At one end of the spectrum of fault tolerant systems, recovery and restart are emphasized. At the other end of the spectrum, system testing and diagnosis emphasize system modeling, simulation, and analytical methods to obtain reliability estimates, such as proof of correctness and Mean Time To Failure metrics.
Between these two extremes, many software systems react to faults by taking a snapshot of all available state information at the time of the fault. In these systems, fault diagnosis is done after crash recovery by applying human intelligence to the state snapshot. Future recovery from occurrences of the same problem depends on the human analyst providing a fix for the problem which may require a new release of the software.
A common approach to fault tolerance is a checkpoint/restart mechanism with or without redundant hardware. The redundant hardware is used as a standby when the normal system fails. Test/diagnostic equipment depends on simulation and verification of some abstract model of the system. These methods are not always a practical solution for legacy systems, which are cost-sensitive and change due to market forces. These methods add cost and complexity to the system, making the system harder to debug and maintain. Furthermore, the redundant hardware adds to the overall costs of the system.
Systems not designed for fault tolerance have tools for fault diagnosis. One such technique involves taking a snapshot of the system where the snapshot is more complete and is taken at the precise time the fault or crash occurred or is detected. This type of complete snapshot typically provides a wealth of raw system state data that can be used for pure diagnosis and is in a human readable or accessible form, normally with a debugger or crash analyzer. Human intelligence is needed to get from symptoms to root causes and, as such, is labor-intensive and is done off-line, i.e., after unrecoverable damage has been done and the system has crashed. Although the snapshot is more complete, diagnostic information is still limited to the static snapshot of the system. A dynamic response to the fault cannot be determined since the dynamic response is gratuitously altered to capture the static snapshot and to then crash and reboot the system.
When a fault occurs in a system, system state information is unreliable. This makes implementing a sophisticated fault handler problematic since it must work under conditions where correctness of operation is suspect. Fault handlers are software systems and, thus, prone to the same types of failures they are designed to handle. The problem is exacerbated by difficulty in testing the fault handler for the various scenarios it must handle. If the scenarios were known, the fault could have been avoided. Methods to handle faults must consider not only the specifics of the fault but also the context in which the fault occurs. For example, the effect of a fault in an application level process context will differ from the effect of a similar fault in an interrupt handler. It is difficult to test for all possible scenarios. Thus, there is the risk of inadequately tested software attempting to diagnose and recover from an unknown and unexpected state and at a time when system operation is unreliable, making diagnostic/recovery more difficult than would be otherwise. Consequently, it is common to keep the fault handler as simple as possible.
Another method of diagnosing a fault involves using analytical methods, an expert system, or some type of modeling and simulation. These techniques may generate test vectors which are applied to the target system to study its response or to generate measures of reliability or stability. For numerous reasons, such methods are impracticable in applications where there is a rapidly evolving code base, typically in response to market forces. Such methods, used typically in academic settings, require a very stable software code base since much time and effort must go into formulating a model, setting up a test rig, and for data collection and analysis. These methods are off-line and are performed with reference to a model of the system and, thus, limited to that model, which rapidly becomes obsolete.
FIG. 1
is a flow diagram of a generic or abstract process of handling system faults used in the techniques described above and known in the field of fault handling software systems. A system fault handler (typically a component or module in a normal operating computer system), executing concurrently with other processes during normal operation of the computer system, begins with determining whether a fault that has occurred is a fault from which the system can recover at step
102
. Recoverable faults are those that the system fault handler has been explicitly designed to handle. If the fault is recoverable, the system fault hander addresses the fault and returns the system to normal operation at step
106
.
The emphasis here is on recovery and restart rather than diagnosis and analysis. In a checkpoint/restart system, the fault handler will use a checkpoint snapshot to return the system to a previous state, with the primary goal of simply getting the system back up and running, the goal with the highest priority in most commercial scenarios. If the fault is not recoverable, control goes to step
104
in which a current snapshot of the system is used. This static snapshot is of the system at the time the fault occurred (i.e., snapshot of current system state) and is used to diagnose the problem off-line. The system is brought back up again by having to take the significant step of rebooting, typically the least desirable way of resuming normal operations.
Therefore, it would be desirable to have a fault tolerant system that is capable of performing system recovery and restart and real-time diagnosis of the fault so that the same fault does not occur repeatedly. It would be desirable if the system fault handler consumed a minimal amount of resources by executing only when a fault occurs and not at all times. This also has the benefit of keeping the hardware and software less complex. In such a system, the degree of human analysis and effort spent on a current system state snapshot would be minimized since much of the diagnosis would be performed by the fault handler. It would also be desirable to be able to self-test and monitor the fault handler for various scenarios so that it can more efficiently restart the system and diagnose the fault and its context. It would be desirable for a fault handler process to permit the system to continue operation after an otherwise catastrophic failure in order to get more data on the dynamic effects of the fault or to recover from the fault.
SUMMARY OF THE INVENTION
To achieve the foregoing, methods, apparatus, and computer-readable media are disclosed for analyzing and recovering from severe faults in a computer system. In one aspect of the invention, a
Chandiramani Anil K.
Roeck Guenter E.
Beausoliel Robert
Beyer Weaver & Thomas LLP
Cisco Technology Inc.
Puente Emerson
LandOfFree
Fault handling process for enabling recovery, diagnosis, and... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Fault handling process for enabling recovery, diagnosis, and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fault handling process for enabling recovery, diagnosis, and... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3316898