Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-10-13
2004-08-03
Iqbal, Nadeem (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S010000
Reexamination Certificate
active
06772367
ABSTRACT:
FIELD OF THE INVENTION
The present invention is generally directed to methods for correcting synchronization faults in concurrently executed computer programs and, more particularly, to methods and systems for fault tolerance of concurrently executed software programs using controlled re-execution of the programs.
BACKGROUND OF THE INVENTION
Concurrent programs are difficult to write. The programmer is presented with the task of balancing two competing forces: safety and liveness. Frequently, the programmer leans too much in one of the two directions, causing either safety failures (e.g. races) or liveness failures (e.g. deadlocks) Such failures arise from a particular kind of software fault (bug), known as a synchronization fault. Studies have shown that synchronization faults account for a sizeable fraction of observed software faults in concurrent programs. Locating synchronization faults and eliminating them by reprogramming is always the best strategy. However, many systems must maintain availability in spite of software failures. Concurrent programs include all parallel programming paradigms such as multi-threaded programs, shared-memory parallel programs, message-passing distributed programs, distributed shared-memory programs, etc. A parallel entity may be referred to as a process, although in practice it may also be a thread.
Traditionally, it was believed that software failures are permanent in nature and, therefore, they would recur in every execution of the program with the same inputs. This belief led to the use of design diversity to recover from software failures. In approaches based on design diversity, redundant modules with different designs are used, ensuring that there is no single point-of-failure. Contrary to this belief, it was observed that many software failures are, in fact, transient (they may not recur when the program is re-executed with the same inputs). In particular, the failures caused by synchronization faults are usually transient in nature.
The existence of transient software failures motivated a new approach to software fault tolerance based on rolling back the processes to a previous state and then restarting them (possibly with message reordering), in the hope that the transient failure will not recur in the new execution. Methods based on this approach have mostly relied on chance in order to recover from a transient software failure. In the special case of synchronization faults, however, it is desirable to do better.
It would therefore be desirable to be able to bypass a synchronization fault and recover from the resulting failure.
SUMMARY OF THE INVENTION
The present invention controls the re-execution of concurrent programs in order to avoid a recurrence of the synchronization failure. The invention provides a method of (i) tracing an execution, (ii) detecting a synchronization failure, (iii) determining a control strategy, and (iv) re-executing under control.
Control is achieved by tracing information during an execution and using this information to add synchronizations during the re-execution.
In accordance with the present invention, a method of providing fault tolerance in concurrently executing computer programs by controlling the re-execution of concurrent programs in order to avoid a recurrence of synchronization failures is provided, comprising:
(a) tracing the execution of concurrent programs;
(b) detecting synchronization failures resulting from said execution of the concurrent programs; and
(c) applying a control strategy, based on said detection of failures, for said execution of the concurrent programs.
Also in accordance with the present invention, application of a control strategy includes causing a re-execution of said concurrent programs under a control derived from tracing information during an execution, and wherein said control includes using said information to add synchronizations to said concurrent programs during re-execution.
REFERENCES:
patent: 4358823 (1982-11-01), McDonald et al.
patent: 5016249 (1991-05-01), Hurst et al.
patent: 5249187 (1993-09-01), Bruckert et al.
patent: 5423024 (1995-06-01), Cheung
patent: 5440726 (1995-08-01), Fuchs et al.
patent: 5530802 (1996-06-01), Fuchs et al.
patent: 5590277 (1996-12-01), Fuchs et al.
patent: 6038684 (2000-03-01), Liddell et al.
patent: 6058491 (2000-05-01), Bossen et al.
patent: 6161196 (2000-12-01), Tsai
patent: 6173414 (2001-01-01), Zumkehr et al.
Garg Vijay K.
Tarafdar Ashis
Board of Regents , The University of Texas System
Bonura Timothy M
Gardere Wynne & Sewell LLP
Iqbal Nadeem
LandOfFree
Software fault tolerance of concurrent programs using... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Software fault tolerance of concurrent programs using..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Software fault tolerance of concurrent programs using... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3348165