Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-08-25
2004-03-09
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
Reexamination Certificate
active
06704884
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to the field of fault-tolerant computer systems. More specifically, the present invention relates to the problem of assigning identifiers to control blocks or other objects used by the software processes that share a common interface when any of these software processes may independently fail over to a redundant backup copy of the software process. Ideally, such identifiers should be the same in primary and backup copies of a software process so that they do not change when a fail over occurs.
2. Description of Prior Art
Fault-tolerant computer systems use a variety of techniques to provide highly-available systems for use in safety-critical or mission-critical environments. Many different approaches have been taken by different organizations to achieve fault-tolerance.
One approach to fault-tolerance is to use specialized hardware and operating systems to mirror all inputs to a number of redundant processing units. Outputs from the system are taken from just one processing unit, called the primary, until it is determined to have failed and another processing unit is selected as the primary. Another approach is to take a majority vote for the correct output, and disabling any processing unit which disagrees with this output on the assumption that it has failed. For further details of this approach to fault-tolerance, see the following U.S. Pat. No. 5,271,013, Gleeson; U.S. Pat. No. 5,363,503, Gleeson; U.S. Pat. No. 5,560,033, Doherty et al.; and U.S. Pat. No. 5,802,265, Bressoud et al.
An alternative approach is to provide fault-tolerance in the software process layer, which avoids the need for specialized hardware or operating system support. This approach is also more easily deployed on a cluster of heterogeneous processing units with different hardware characteristics, since it does not rely on specific attributes of the hardware. Software fault-tolerance, as this approach is commonly called, typically uses a combination of redundant backup software processes and replication of internal state between the primary and backup copies of each software process to speed recovery from any software or hardware faults. However, many practical fault-tolerant systems combine both hardware and software fault tolerance techniques. For further details of the general techniques used to achieve software fault-tolerance, see the following U.S. Pat. No. 5,129,080, Smith; and U.S. Pat. No. 5,748,882, Huang. See also the following publications: Hardware and Software Architectures for Fault Tolerance, Chapter 3, ed. Banatre et al., Springer-Verlag 1994; Fault Tolerance in Distributed Systems, Jalote, Chapter 5, Prentice Hall 1994; and Fault-Tolerant Computer System Design, Chapter 7, Pradhan, Prentice Hall 1996.
Since software fault-tolerance does not have the benefit of hardware assistance, the performance of systems employing software fault-tolerance can be an issue in some environments. In particular, a potential performance bottleneck is the addressing of control blocks or other objects used on an interface between two software processes, known as “partner” software processes, each of which can fail over independently to a backup software process. This situation is further complicated if the backup processes may be running on a different processing unit, possibly employing a different processor architecture. Standard techniques that are commonly used to identify control blocks and objects on software interfaces all suffer from some disadvantages in a distributed, heterogeneous fault-tolerant system.
Use of names for the control blocks necessitates a search of all names on each request, and hence gives poor performance.
Use of memory addresses, or “pointers” as they are commonly known, gives good performance until one partner software process fails over to a backup. After fail over, however, a resynchronization phase between the partner software processes is required in order to exchange replacement addresses as the backup that has taken over as primary may not have allocated the control blocks at the same address as the failed primary. This resynchronization requires the use of some form of name-based search for the control blocks or objects, which is potentially a slow operation. Window conditions in the interface between software processes may also make the direct exchange of a pointer with a partner software process unsafe and prone to cause system failures.
The potential system failures caused by direct use of pointers can be avoided by using a direct index or an indirect index, known as a “handle”, taken from an extensible pool of handles. Unfortunately the use of an index or handle still necessitates resynchronization between partner software processes after fail over because there is no guarantee that the index or handle value assigned by the primary and backup copies of a same software process type will match.
BRIEF SUMMARY OF THE INVENTION
The present invention, known as a “replicated handle”, is a means of identifying control blocks or objects on an interface between partner software processes in a software fault-tolerant system. Replicated handles make use of a close coupling between the functions of a pool manager for indirect index values and the replication of internal state information between primary and backup copies of a software process. This allows handles to be replicated either by piggybacking the exchanges used to replicate internal state between primary and backup copies of a software process, or as an explicit action when required.
The present invention has the following advantages over prior art:
The present invention allows partner software processes to exchange identifiers for control blocks or objects with a guarantee that the identifiers do not change when one or more of the partners fails over to a redundant backup copy, and preserving efficient access to the control blocks or objects within each software process. The fact that the identifiers used for the control blocks or objects do not change avoids the need for resynchronization of these identifiers after a fail over.
The present invention achieves the replication of handles without requiring any additional message exchanges between primary and backup copies of a software process by piggybacking on the replication of internal state that is required for software fault-tolerance.
The present invention allows the number of replicated handles available to a software process to be extended dynamically in order to cope with variations in the work load presented to the system.
The present invention is independent of the system hardware architecture or operating system, and can be used in heterogeneous distributed systems.
REFERENCES:
patent: 5465328 (1995-11-01), Dievendorff et al.
patent: 5504883 (1996-04-01), Coverston et al.
patent: 5577240 (1996-11-01), Demers et al.
patent: 5621885 (1997-04-01), Del Vigna, Jr.
patent: 5724500 (1998-03-01), Shinmura et al.
patent: 5734897 (1998-03-01), Banks
patent: 5978933 (1999-11-01), Wyld et al.
patent: 6199178 (2001-03-01), Schneider et al.
patent: 6449733 (2002-09-01), Bartlett et al.
patent: 6516314 (2003-02-01), Birkler et al.
patent: 0409604 (1991-01-01), None
patent: 0590866 (1994-04-01), None
Brittain Paul John
Dancer Colin Michael
Miller Benjamin Mark Simon Kenneth
Reekie David William Maxwell
Shepherd Adam Paul
Beausoliel Robert
Bonzo Bryce P.
Shepherd Adam
LandOfFree
Replicated control block handles for fault-tolerant computer... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Replicated control block handles for fault-tolerant computer..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Replicated control block handles for fault-tolerant computer... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3196363