Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Patent
1997-09-25
2000-02-08
Le, Dieu-Minh T.
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
714 11, 714 15, 709300, 709400, G06F 1100
Patent
active
060237724
DESCRIPTION:
BRIEF SUMMARY
FIELD OF THE INVENTION
The present invention relates to a fault-tolerant processing method for receiving and processing input messages to produce output messages. More particularly, the present invention relates to a method of operating a software fault tolerant recovery unit where the processing of input messages is done by replicate primary and secondary application processes.
It should be noted that the term "process" is used herein in a general sense of processing functionality provided by code executing on a processor however this code is organised (that is, whether the code is an instance of only part of a program, or of a whole program, or is spread across multiple programs). Furthermore, reference to a process as being an "application" process is intended to be understood broadly in the sense of a process providing some desired functionality for which purpose input messages are sent to the process.
BACKGROUND OF THE INVENTION
Software-based fault-tolerant systems may be considered as organised into one or more recovery units each of which constitutes a unit of failure and recovery. A recovery unit may be considered as made up of a live process, an arrangement for logging recovery information relevant to that live process, and recovery means, which in the event of failure of the live process, causes a replacement process to take over.
Of course, if failure of the live process due to failure of the processor running it is to be covered, then both the storage of recovery information and the recovery means itself must be separate from the processor running the live process.
Where a system comprises multiple recovery units, these will typically overlap in terms of processor utilisation; for example, the processor targetted to run the replacement process for a first recovery unit, may also be the processor running the live process of a second recovery unit. In fact, there may also be common resource utilisation by the recovery units in respect of their logging and recovery means.
An illustrative prior-art fault-tolerant computer system is shown in FIG. 1 of the accompanying drawings. This system comprises three processors I, II, III and a disc unit 10 all interconnected by a LAN 11. The system is organised, as two recovery units A and B each of which has an associated live process A/L, B/L. Live process A/L runs on processor I and live process B/L runs on processor II. Recovery unit A is arranged such that upon failure of its live process A/L, a replacement process A/R will be take over on processor II; similarly, recovery unit B is arranged such that should live process B/L fail, a replacement process B/R takes over on processor III.
A live process will progress through a succession of internal states depending on its deterministic behaviour and on non-deterministic events such as external inputs (including messages received from other live processes, where present) and non-deterministic internal events.
When a replacement process takes over from a failed live process, the replacement process must be placed in a state that the failed process achieved (though not necessarily its most current pre-failure state). To do this, it is necessary to know state information on the live process at at least one point prior to failure; furthermore, if information is also known on the non-deterministic events experienced by the failed process, it is possible to run the replacement process forward from the state known about for the failed process, to some later state achieved by the latter process.
Where speed of recovery is not critical, an approach may be used where state information on the live process (process A/L in FIG. 1) is periodically checkpointed by the logging means of the recovery unit from the volatile memory of the processor running the process to stable store (disc unit 10). Upon failure of the live process A/L, the recovery means of the recovery unit can bring tip a replacement process A/R in a state corresponding to the last-checkpointed state of the failed live process. Of course, unless check-po
REFERENCES:
patent: 4590554 (1986-05-01), Glazer et al.
patent: 4665520 (1987-05-01), Strom et al.
patent: 4937741 (1990-06-01), Harper et al.
patent: 5157663 (1992-10-01), Major et al.
patent: 5235700 (1993-08-01), Alaiwan et al.
patent: 5325528 (1994-06-01), Klein
patent: 5455932 (1995-10-01), Major et al.
patent: 5530802 (1996-06-01), Fuchs et al.
patent: 5555371 (1996-09-01), Duyanovich et al.
patent: 5590277 (1996-12-01), Fuchs et al.
patent: 5835953 (1998-11-01), Ohran
"A Principle for resilent sharing of distributed resources" by Peter A. Alsberg and John D. Day, Proceedings 2nd International Conference on Software Engineering, San Francisco, CA, Oct. 13-15, 1976, pp. 562-570.
Fault Tolerance Under UNIX pp. 1-24.
Using Passive Replicates in Delta-4 To Provide Dependable Disrtibuted Computing pp. 184-190.
Hewlett--Packard Company
Le Dieu-Minh T.
LandOfFree
Fault-tolerant processing method does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Fault-tolerant processing method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fault-tolerant processing method will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-1689239