Apparatus and method for reducing duration of timeout...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S797000

Reexamination Certificate

active

06363496

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention relates to a fault-tolerant distributed computer system, and, in particular, to a distributed computer system that relies upon timeout to detect if a component of the system has failed. That is, when the system waits more than a certain period of time for an action to occur, it declares something is wrong. The present invention makes use of statistical methods to improve the system's timeliness by reducing the duration of the waiting period prior to the system issuing a timeout.
Distributed computer systems consist of individual computers interconnected by a network. The individual computers cooperate in executing a common program by exchanging messages over the network. The individual computers are called “the computing nodes” or simply “the nodes”. The distributed computer system itself is called “a distributed system”. In a fault-tolerant distributed system the distributed program continues to execute its instructions even if some nodes fail.
A node may fail by simply ceasing to execute instructions. Since it does not send spurious or corrupted information over the network, other nodes cannot sense the failure. The failed node merely remains silent by not responding to messages sent to it by other nodes. No response to a message shows the sending nodes that the receiving node has failed. However, even nodes that have not failed do not respond instantaneously. Transmission delays in the network account for some lack of responsiveness. Delays can be further compounded if messages are service requests that the receiver satisfies with its response.
Since absence of a response is insufficient for deciding that the receiving node has failed, the sending node has to set a limit on the time it will wait. This wait is called a timeout period. If the timeout period passes without the anticipated action, then a timeout has occurred.
Deciding how long the timeout period should be is critical to the overall operation of the distributed system. Too short a timeout period, when operating nodes prolong their responses merely because of a heavier-than-normal workload, can cause the system to regard a node as failed. Too long a timeout period allows a failed node to suspend system operation until the timeout occurs.
Watchdog timers, as described by Johnson (
Design and Analysis of Fault
-
Tolerant Digital Systems
, Addison Wesley, 1989, Pages 68-69) and Lee and Anderson (
Fault Tolerance Principles and Practice
, 2
nd
Edition, Springer-Verlag, 1990, Pages 104-105), enable a timeout. To detect a lack of response, timing checks are imposed on tasks at mandatory intervals during execution. Prior to the end of each interval, receipt of an “I am alive” message is expected. The watchdog timer is set to a value that corresponds to the expected execution time until the next “I am alive” message. There has to be leeway to compensate for slight variations in execution time within an interval. But it is easier to estimate the expected variation in several small intervals during the task than to estimate when the entire task should be completed. As more time passes prior to a timing check, the longer will a processor executing a task be exposed to factors that cause it to drift from what might be “expected.” (Note that expectation here is not the mathematical expectation that corresponds to the mean of a statistical distribution. Instead its connotation is what the system designer believes is reasonable, based on the task's demands for algorithms and resources.)
Watchdog timers ease the constraints on limiting a waiting period. They do so by substituting several small problems, each of which is easier to solve, for the whole problem. But watchdog timers have drawbacks. Sending an “I am alive” message may be difficult unless software was specifically written to support watchdog timers. Even if the software does support watchdog timers, sending a message requires that a task suspend execution, thereby reducing system throughput. In distributed systems, frequent “I am alive” messages create traffic that causes network congestion. System performance suffers, thereby compounding the problem of estimating the shortest practical timeout period.
The prior-art method of watchdog timers thus does not succeed in limiting the timeout period in fault-tolerant distributed systems. One alternative is to permit unbounded message delays, forego automating the calculation of an optimal timeout, and let a human operator detect the lack of system response. However, the comparatively slow reaction of a human eliminates this alternative from consideration for all applications but those few where an operator is present and a rapid response is not required.
Unfortunately the cases are rare where one can accept unbounded message delays. So we require a fault-tolerant distributed system that ensures the existence of an upper bound d on the timeout period for nodes that work properly. See Cristian, “Understanding Fault-Tolerant Distributed Systems,” 34
Communications of the ACM
(February 1991), the disclosure of which is incorporated herein by reference.
To mitigate some unwanted side effects of watchdog timers, designers can reduce the number of timing checks by increasing the intervals between them. Considering only the completion time for the entire task can eliminate intermediate checks. In either case, designers must choose a limit to the length of a timeout period. A timeout's duration is thus based on a designer's assumptions about the real-time behavior of the operating system, the maximum system load, and the application of massive redundancy to the communication hardware. The designer tries to ensure a negligible chance that network partitions will occur as described by Cristian (Ibid.). To avoid making a timeout period too short, it must be based on worst-case scenarios. Even though the worst case may be most unlikely, the prior art treats a conservative approach based on a worst case as superior to risking the inadvertent loss of operating nodes through premature timeout.
Timeout is important for working nodes to detect a node that fails by omitting a response. However, a failed node may send a timely response whose data is corrupted. To cope with such a failure, it is necessary to replicate a particular task on several nodes at the same time. By creating more than a single instance of the task, discrepancies can be detected by comparing the outputs from every node that executes the task. When three or more nodes execute the same task, the correct output is presumed to be the result of a “vote” among them. That is, each node offers its own solution to the task, and the system brings all the solutions together to decide which commands a majority. Each node that executes a redundant task communicates its results to a voter that collates all the results. Making the voter's output represent the majority result from the nodes masks an erroneous result. More than half the results must agree to form a majority.
The voter may be a specially designated node, or it may be distributed among the nodes. At the start of the task, nodes are synchronized so that voting takes place when the task is completed. A centralized voter that has independent connections to the nodes can collect their results in parallel. When the voter is distributed, each node broadcasts its results to the other nodes, so that each node determines the majority.
Timing is critical in voting. If results arrive at the voter at slightly different times, incorrect results can be generated temporarily. In many applications, an incorrect result cannot be allowed for even a very small period. Furthermore, if some of the initial results arrive at a voter simultaneously, a remaining node may be incorrectly declared faulty because its results arrived at the voter after voting took place. For these reasons, it is important to synchronize voting.
Some voting schemes permit unsynchronized nodes. The unsynchronized inputs are first marshaled and then fed simultaneously to the voter so that they a

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Apparatus and method for reducing duration of timeout... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Apparatus and method for reducing duration of timeout..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and method for reducing duration of timeout... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2828449

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.