Heartbeat failure detector method and apparatus

Electrical computers and digital processing systems: multicomput – Computer-to-computer data routing – Prioritized data routing

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C709S201000, C709S223000, C370S216000, C714S004110, C714S038110

Reexamination Certificate

active

06728781

ABSTRACT:

BACKGROUND
As the use of computer networks in safety-critical systems increases (e.g., fly-by-wire commercial aircraft systems, automated train and subway operation systems, airline traffic control systems, medical systems, banking and stock market systems, etc.) the demand for fault-tolerant and highly-available distributed systems also rapidly increases. A crucial component of any such system is reliable communication. As such, there is a urgent need for a method and apparatus that can monitor network systems with process crashes, message losses, and network partitioning. Furthermore, such a method or apparatus could also be used to provide efficient solutions of several other core components of fault-tolerant distributed systems, such as Consensus, Atomic Broadcast, and Atomic Commitment.
Computer networks are often subject to failure. Typically these failures include crashes of processors and processes, message losses, and network partitioning. With such failures, achieving reliable communication is difficult and expensive. The following simple example illustrates the basic problem.
Suppose process x sends a message m to process y, but x does not receive the acknowledgment ack(m) indicating that y received m. Should x keep on trying to send copies of m to y, or should it give up? The answer depends on why x is not receiving ack(m) from y. It could be either because (a) y has crashed or is partitioned away from x, or because (b) the copies of m or ack(m) sent so far were lost due to transient link failures. In the first case x should not continue resending m (because it is useless and thus a waste of network resources), and in the second case, x should try to resend m again. The problem is that x cannot distinguish between the two cases.
To deal with this problem, existing communication protocols either use timeouts or limit the number of allowed retransmissions. In the first scheme, x keeps on resending copies of m until its timeout period expires, and in the second scheme, it resends m until it reaches the maximum allowed number of retransmissions. If processor x gets no acknowledgments by then, it gives up sending m to y. This type of solution is problematic for two reasons. First, the selection of an adequate timeout period, or of the maximum number of retransmissions, is difficult because it depends on many system parameters that may change over time. Second, x sends several copies of m toy even if y is actually crashed (or partitioned away from x)—a waste of network bandwidth.
SUMMARY
The Heartbeat Failure Detector is a new, simple method and device that solves the above problem. This device has several identical modules, and each module is attached to a processor in the system. Roughly speaking, the module of a processor x maintains a heartbeat counter for every other process. For each process y, the counter of y at x (called the heartbeat of y at x) periodically increases while y is alive and in the same network partition, and the counter stops increasing after y crashes or becomes partitioned away. Using such a device, x can solve the communication problem above by resending m only if the heartbeat of y at x increases. Note that if y is crashed (or partitioned away from x), its heartbeat at x stops almost immediately, and so x will also immediately stop sending copies of m to y.
An important feature of the Heartbeat Failure Detector is that it can easily be implemented without using timeouts. The detailed specification shows how to implement the Heartbeat Failure Detector in several types of networks where processes may crash and links may lose messages. In particular, we consider networks where all communication links are bidirectional, networks where some links are unidirectional, and networks that may partition. The basic ideas behind these implementations are as follows. For networks that do not partition, processors (processes) periodically broadcast “heartbeat” messages: whenever a processor x receives a heartbeat message from a processor y, it increments the heartbeat counter of y at x. For networks that may partition, x increments the heartbeat counter of y each time x receives a heartbeat message that was sent by x and later relayed by y.
Note that the Heartbeat Failure Detector does not use timeouts on the heartbeats of a process in order to determine whether this process has failed or not. The Heartbeat Failure Detector just counts the total number of heartbeats received from each process, and outputs these “raw” counters without any further processing or interpretation. Thus, the Heartbeat Failure Detector should not be confused with existing implementations of failure detectors (some of which, such as those in Ensemble and Phoenix, have modules that are also called heartbeat). Even though existing failure detectors are also based on the repeated sending of a periodic message, they use timeouts on these messages in order to derive lists of processes considered to be up or down; applications can only see these lists. In contrast, the Heartbeat Failure Detector simply counts heartbeats, and shows these counts to applications.
The Heartbeat Detector works with partitionable networks with process crashes and lossy links, and is applicable to the problems of reliable communication and consensus for such networks. For both problems we developed algorithms that are quiescent, i.e., algorithms that eventually stop sending messages. We first tackle the problem of reliable communication for partitionable networks by extending the results of simple point-to-point solutions. In particular, we generalize the specification of the heartbeat failure detector (hereinafter “HB”), show how to implement it, and show how to use it to achieve quiescent reliable communication. We also show how to use the HB for consensus for partitionable networks. We first show that, even though this problem can be solved using a natural extension of failure detector~S, such solutions are not quiescent—in other words, S alone is not sufficient to achieve a quiescent consensus in partitionable networks. We then solve this problem using~S and the quiescent reliable communication primitives that we developed for the HB.
For those skilled in the art who desire further proof of the efficacy of the invention, we provide an Appendix of lemma showing how the invention necessarily performs its disclosed operations.


REFERENCES:
patent: 5390326 (1995-02-01), Shah
patent: 6324161 (2001-11-01), Kirch
Robbert van Renesse et al., “A Gossip-Style Failure Detection Service”, Dept. Comp. Science, Cornell University, pp. 1-16, 1996.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Heartbeat failure detector method and apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Heartbeat failure detector method and apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Heartbeat failure detector method and apparatus will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3266927

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.