Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-12-10
2001-11-20
Ray, Gopal C. (Department: 2181)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C710S107000
Reexamination Certificate
active
06321344
ABSTRACT:
TECHNICAL FIELD
This invention relates to reliable distributed processing systems, and more specifically, to systems having means for detecting the failure of a controlling computer.
Problem
When the data processing load of a system, e.g., is large enough to require more than one processor, a distributed processing system is frequently used. This type of distributed processing system not only allows more processing to be performed, but has higher reliability if enough processors are provided so that the failure of one processor still leaves the system with enough processing power to handle the processing load.
In such systems, it is usually necessary to have one lead processor which is assigned the role of assigning global resources, a role which cannot be usefully performed by two processors simultaneously. In a distributed processing system, it is necessary to detect the failure of any processor, but it is especially important to detect the failure of a lead processor and to reassign that role to another processor, since the lead processor allocates global resources, such as space on a shared disk memory, and controls which processors are actually performing specific data processing functions.
Two approaches have been used to identify problems in the lead processor. In the reliable computing complex, (RCC) system manufactured by Lucent Technologies Inc., a special small processor, called a “Watchdog”, continuously monitors the performance of the lead processor and other processors by verifying that their state information matches that expected by the Watchdog, and testing to make sure that each of them generate a “heart-beat” signal representing the successful performance of basic operating functions. The Watchdog, itself, is designed to be especially reliable.
In many other systems, there is a single processor which ends up being the lead processor. In RCC, that is the Watchdog hardware. For those systems using a lead processor, usually a personal computer (PC), some mechanism is used to select, (hopefully), a single PC or computer to perform that function. Once a lead PC is chosen, it does not move until there is some failure. In existing commercial systems, intermittent failures and other rare occurrences can sometimes cause a second PC to become a lead, and that causes a lot of trouble.
The Microsoft Cluster Solution, offered by the Microsoft Corporation, approaches this problem somewhat differently. They attempt to replicate all data simultaneously on all machines. Complicated algorithms are used to assure this happens correctly. Different complicated algorithms are used to determine which computer is entitled to obtain a shared device when more than one computer wants it. Neither of these algorithms is perfect. There is no issue about verifying that all data is replicated correctly on all machines, since there is a single place where the “golden” data lives.
Another approach in achieving reliable distributed processing systems is to assign a lead processor role to any one of the processors, and to have that lead processor perform the Watchdog role, i.e., the role of insuring that each of the other processors is still in satisfactory operating condition. A problem arises in such latter systems if for some reason, two processors simultaneously are set to a state wherein they perform the Watchdog role. Further, under such circumstances, the arrangements for detecting faulty processors and switching them off-line, tend to be very unreliable.
Solution
Applicant has overcome these problems, and has made a contribution over the prior art in an arrangement wherein the Watchdog role, which is carried by a token called a Watchdog object, travels periodically from processor to processor; in addition, the next processor to act as a Watchdog processor is initialized with Watchdog data via a “Ghost Object”; if the next processor does not receive the signal and data to become the new Watchdog, it will automatically seize the role of Watchdog, and in doing so, send a next Watchdog indicator (“Ghost Object”), to the next processor. A processor which receives a next Watchdog indicator, having a further indication that the predecessor processor did not pass on a Watchdog Token, will initiate tests of the processor that failed to send on this Watchdog Token. If the results indicate that a processor is faulty, it is switched out of the loop of processors performing the Watchdog function. Advantageously, this arrangement allows for a highly reliable assignment of a Watchdog role, and thereby makes possible a highly reliable distributed computing system; advantageously, no additional Watchdog apparatus is required.
REFERENCES:
patent: 4777591 (1988-10-01), Chang et al.
patent: 5247694 (1993-09-01), Dahl
patent: 5491803 (1996-02-01), Herrmann et al.
patent: 5953510 (1999-09-01), Herzl et al.
patent: 5999976 (1999-12-01), Schmuck et al.
Lucent Technologies - Inc.
Ray Gopal C.
Werner Ulrich
LandOfFree
Reliable distributed processing system does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Reliable distributed processing system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Reliable distributed processing system will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2612583