Failure detector with consensus protocol

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S047300, C714S023000, C714S015000

Reexamination Certificate

active

06687847

ABSTRACT:

FIELD OF INVENTION
This invention relates to interconnected computer systems, and more specifically to interconnected fault-tolerant computer systems which perform distributed processing.
BACKGROUND OF INVENTION
Distributed computing is the design, programming and operation of a system of interconnected computers to achieve some common goal. In a stricter sense, distributed computing allows the processes performed by a given computing platform or set of platforms in the system to be performed in one or more other platforms in a coordinated fashion without affecting the computational results obtained by the processes. Fault-tolerant distributed computing provides for such a system to compensate reliably, accurately and in a timely way, and without manual intervention, for any of a specified range of failures which may occur during the pursuit of that goal.
The successful management of faults, errors and failures in such a computing environment is complex, but it is essential for any application requiring the cooperation of multiple computers on a real-time basis, especially where the application and the system must support or protect human life. Such situations include medical diagnostic and life-support systems, aircraft fly-by-wire systems, banking, finance and stock-trading systems, and spacecraft environment control and repair systems.
Most computer applications require only a single processor and memory, storage space connected exclusively to that processor, and a link to the Internet, with only one user in control. For such systems, no fault-tolerance exists; if the system encounters a hardware fault or a software error, it will either fail the faulty component, terminate the software in error, or crash the system. As any user of a desktop computer will attest, recovery is a manual affair, and may take a long time.
By contrast, fault-tolerant systems must continue acceptable operation under failure conditions. Fault-tolerant systems require some level of hardware redundancy, keeping reserve hardware and software processing components available for use in sufficient numbers and types to handle ongoing and anticipated workload whenever an operating component fails. Since a processing component may fail, multiple processing components are designed into fault-tolerant systems to allow uninterrupted completion of work in progress. The processing components cooperate in the distribution, execution and completion of tasks, and the assembly and distribution of their results.
Failure of a processing component in a fault-tolerant distributed system does not imply that the failed component stops all processing and relinquishes its workload. Some types of failure permit reboot and recovery of the failed component, allowing it to continue with active service. Under conditions of heavy load, this reactivation of a failed component may be essential for the system to continue to deliver its outputs as required.
Unfortunately, such reactivations may not correct the problem causing the failure in the first place, and a reactivated processing component may resume its work, deliver some results, and fail again, more than once. How does the system accommodate such erratic behavior? How can the system make a valid determination as to the state of its processing components, and act accordingly to complete its work in an acceptable manner?
The same question applies whether the processing components in question are hardware components, software components, or a combination or blend of the two types. For the purposes of this discussion, a process and a processor are considered to have similar, even identical, problems of behavior. The terms “process” and “processor” are used interchangeably here.
Numerous mechanisms have been constructed in hardware and software to prevent individual component failures from stopping an entire system or rendering its processing ineffective. A critical problem with many such mechanisms is that they cannot reliably detect and identify failures of other system components. A failure message may be lost. A failure-detection component may itself malfunction. A working component may be falsely identified by malfunctioning hardware or software as one that has failed.
Consensus
The best range of solutions to the problem of reliable failure detection falls under the heading of consensus: getting the active, reliable processes in a system to agree on some common decision value, such as whether to commit or abort a transacction, whether or not a given component of the system has failed. Depending on the degree of complexity of failures the system is designed to handle, such consensus may require a high level of redundancy of processes. This requirement adds significant capital and operating cost to the system.
The problem of solving consensus in asynchronous systems with unreliable failure detectors (i.e., failure detectors that make mistakes) was first investigated in [CT96, CHT96]. These works only considered systems where process crashes are permanent and links are reliable (i.e., they do not lose messages). In real systems, however, processes may recover after crashing and links may lose messages. The problem of solving consensus with failure detectors in such systems was first considered in [DFKM96, OGS97, HMR97].
Solving consensus in a system where processes may recover after crashing raises two new problems; one regards the need for stable storage and the other concerns the failure detection requirements.
First, regarding stable storage: when a process crashes, it loses all its local state—the memory of what is going on at the time of the crash. This “memory loss” limits severely the actions a process can take upon its recovery. One way proposed for dealing with this problem is to assume that parts of the local state are recorded into stable storage, and can be restored after each recovery. But since stable storage operations are slow and expensive, they must be avoided as much as possible. Is stable storage always necessary when solving consensus? If not, under which condition(s) can it be completely avoided?
Second, regarding failure detection: in the crash-recovery model, a process may keep on crashing and recovering indefinitely. Such a process is called unstable. How should a failure detector view unstable processes? An unstable process may be as useless to an application as one that permanently crashes, and may in fact be disruptive to overall system operation. For example, an unstable process can be up just long enough to be considered operational by the failure detector, and then crash before “helping” the application; this up-down-up cycle could repeat indefinitely. It would be natural to require that a failure detector satisfy the following completeness property: Eventually every unstable process is permanently suspected.
Implementing such a failure detector is difficult even in a perfectly synchronous system—one in which the stages of operation are synchronized and timed, and expectations of completion may be set and met. The difficulty is due to the fact that, at any given point in time, no such implementation can predict the future behavior of a process that has crashed in the past but is currently “up”. Will this crashed process continue to repeatedly crash and recover, or will it stop crashing?
The problem of solving consensus with failure detectors in systems where processes may recover from crashes was first addressed in [DFKM96], with crash recovery as a form of omission failure. More recently the problem was studied in [OGS97, HMR97]. In these three works, the question of whether stable storage is always necessary was not addressed, and all the algorithms used stable storage. In [DFKM96, OGS97] the entire state of the algorithm is recorded into stable storage at every state transition. In [HMR97], only a small part of the state is recorded, and writing to stable storage is done at most once per round. The algorithm in [DFKM96] is not designed to

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Failure detector with consensus protocol does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Failure detector with consensus protocol, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Failure detector with consensus protocol will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3334067

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.