Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Patent
1998-04-03
1999-08-17
Beausoliel, Jr., Robert W.
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
714 16, G06F 1114
Patent
active
059387750
ABSTRACT:
A fault tolerant message passing system includes a plurality of interconnected processors with storage and a watchdog process wherein the processors may undergo failure. A method restores a consistent system state using optimistic logging protocol with asynchronous recovery. Each process comprises a sequence of state intervals and includes checkpoints for saving in storage the state of the process sufficient to re-start execution of the process. Non-deterministic event messages are logged in storage by each process for replay after process re-start to reconstruct pre-failure state intervals. Transitive dependency tracking of messages and process states is performed to record the highest-index state interval of each process upon which a local process depends. A variable size dependency vector is attached to each outgoing message sent between processes. An integer K is assigned to each outgoing message as the upper bound on the vector size. The vector for the local process is updated upon receiving each incoming message. A process failure is detected and the failed process is re-started. The latest checkpoint is restored and the logged messages are replayed. A new incarnation of the failed process is started and identified by P.sub.i, t where (i) is the process number and (t) is the incarnation number, each state interval being identified by (t,x).sub.i where (x) is the state interval number. A failure announcement is broadcast to the other processes, the announcement containing (t,x).sub.i where (x) is the state interval number of the last recreatable state interval of the failed process incarnation P.sub.i, t. Upon receiving a failure announcement containing (t,x).sub.i, the entry for process (i) is extracted from the local dependency vector. The entry for process (i) is compared to the (t,x).sub.i contained in the failure announcement. The process is classified as orphaned from the comparison if the process depends upon a higher-index state interval than (t,x).sub.i. A process roll-back is performed to reconstruct only non-orphaned state intervals in the rolled-back process.
REFERENCES:
patent: 4570261 (1986-02-01), Maher
patent: 4665520 (1987-05-01), Strom et al.
patent: 5396613 (1995-03-01), Hollaar
patent: 5440726 (1995-08-01), Fuchs et al.
patent: 5485608 (1996-01-01), Lomet et al.
patent: 5530802 (1996-06-01), Fuchs et al.
patent: 5590277 (1996-12-01), Fuchs et al.
patent: 5630047 (1997-05-01), Wang
Wang et al., "Optimistic Message Logging for Independent Checkpointing in Message-Passing Systems", IEEE, pp. 147-154, 1992.
ACM Transactions on Computer Systems, vol. 7, No. 1, Feb. 1989, pp. 1-24, by Anita Borg et al., "Fault Tolerance Under UNIX".
Proceedings of 16th Int'l Conference on Distributed Computing Systems, May 27-30, 1996, IEEE Computer Society Press, pp. 108-115, O. P. Damani et al., "How to Recover Efficiently and Asynchronously when Optimism Fails".
School of Computer Science, Carnegie Mellon Univ., (CMU-CS-96-181), Pittsburgh, PA. & ACM Computing Surveys; E.N. Elozahy et al.; pp. 1-46; "A Survey of Rollback-Recovery Protocols in Message-Passing Systems".
IEEE 24th Int'l Symposium on Fault-Tolerant Computing, Jun. 15-17, 1994; E. N. Elnozahy & W. Zwaenepoel; pp. 298-307; "On the Use and Implementation of Message Logging".
IEEE 25th Int'l Symposium on Fault-Tolerant Computing, Jun. 27-30, 1995; Y. Huang & Yi-Min Wang; pp. 459-463; "Why Optimistic Message Logging Has Not Been Used In Telecommunications Systems".
In Proceedings 12th Symposium on Reliable Distributed Systems, Oct. 6-8, 1993; David B. Johnson; pp. 86-95; "Efficient Transparent Optimistic Rollback Recovery for Distributed Application Programs".
J. of Algorithms, vol. 11, No. 3, Sep. 1990; D. B. Johnson & W. Zwaenepoel; pp. 462-491; "Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing".
Com. of the Assn. for Computing Machinery, vol. 21, No. 7, Jul. 1978; Leslie Lamport; pp. 558-565; "Time Clocks, and the Ordering of Events in a Distributed System".
IEEE Proc. 10th Symposium on Reliable Distributed Systems, Sep. 30 to Oct. 2, 1991; A. Lowry et al.; pp. 66-75; "Optimistic Failure Recovery for Very Large Networks".
Proc. of 8th Annual ACM Symposium on Principles of Distributed Computing, Aug. 14-16, 1989; A. P. Sistla & J. L. Welch; pp. 223-238; "Efficient Distributed Recovery Using Message Logging".
IEEE Proc. 25th Int'l Symposium on Fault-Tolerant Computing, Jun. 27-30, 1995; Sean W. Smith et al.; pp. 361-370; "Completely Asynchronous Optimistic Recovery with Minimal Rollbacks".
ACM Transactions on Computer Systems, vol. 3, No. 3, Aug. 1985; R. E. Strom and S. Yemini; pp. 204-226; "Optimistic Recovery in Distributed Systems".
IEEE Proc. 12th Symposium on Reliable Distributed Systems, Oct. 6-8, 1993; pp. 78-85; Yi-Min Wang & W. Kent Fuchs; "Lazy Checkpoint Coordination for Bounding Rollback Propagation".
Damani Om P.
Garg Vijay Kumar
Wang Yi-Min
AT & T Corp.
Baderman Scott T.
Beausoliel, Jr. Robert W.
LandOfFree
Distributed recovery with .kappa.-optimistic logging does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Distributed recovery with .kappa.-optimistic logging, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Distributed recovery with .kappa.-optimistic logging will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-311154