Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-11-07
2004-05-25
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C710S200000
Reexamination Certificate
active
06742135
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to a fault tolerant locking mechanism and to the use of such a mechanism in preventing deadlocks in multiprocessor computer systems.
BACKGROUND OF THE INVENTION
As used herein, the term computer includes any device or machine capable of accepting data, applying prescribed processes to the data, and supplying the results of the processes. A multiprocessing computer system has multiple processes executing on the system. Each process performs a particular task, and the processes, taken as a whole, perform some larger task, typically called an application. These processes may be executing on a single central computer or they may be running on separate computers which are connected to each other via some type of communications link, i.e., a distributed or networked computer system.
In multiprocessing systems, resources are often shared among the executing processes. Such resources may include, for example, disk drives, printers, shared memory and databases. During processing, a process may require exclusive access to a resource, such that another process may not use that resource until the first process is finished with it. Thus, several processes may compete for a finite number of resources. This is commonly known as mutually exclusive sharing of resources.
A problem with mutually exclusive access to resources is the possibility that the computer system will enter a deadlock state. A deadlock state is a state in the computer system in which, because of a resource allocation pattern, the computer system cannot progress past a processing point. For example, consider a computer system with two resources R
1
and R
2
, and two processors P
1
and P
2
, where P
1
and P
2
both need simultaneous exclusive access to R
1
and R
2
at some point in order to successfully complete their processing. If P
1
gains exclusive access to R
1
and R
2
, then P
2
must wait until P
1
releases the resources. In this situation, P
2
is described as being in a “wait” state. Such a situation does not present a problem, because, it is presumed, that P
1
will eventually release the resources, at which time P
2
may gain access to the resources. However, consider the situation in which P
1
gains access to and holds R
1
, and P
2
gains access to and holds R
2
, as shown in FIG.
1
. If this occurs, then P
1
cannot finish its task until it gains access to R
2
, and P
2
cannot finish its task until it gains access to R
1
, i.e., both P
1
and P
2
will enter a wait state. However, P
1
is holding R
1
and will not release R
1
until it gains access to R
2
, and P
2
is holding R
2
and will not release R
2
until it gains access to R
1
. At this point, the system is in a deadlock state.
Systems and mechanisms for deadlock avoidance and recovery are known in the art and described in U.S. Pat. No. 5,664,088 to Romanovsky et al. and U.S. Pat. No. 5,913,060 to Discavage.
Generally, in the prior art, once a deadlock is detected, one of the processes involved in the deadlock is terminated, so that it releases the resource it held, and the resource can be reclaimed by the system. The reclaimed resource may then be used by a waiting process. If the waiting process can finish processing using the reclaimed resource, then the system can progress past the deadlock state. The terminated process, called the victim, is generally selected on a random basis, or based on a static priority assigned to the processes.
In addition to the “circular” deadlock situation shown in
FIG. 1
, a deadlock may also be caused by processors failing, or “crashing,” before they release a resource. For example, in the same two processor system described above, assume processor P
1
holds resource R
1
, and processor P
2
holds resource R
2
and is waiting for R
1
, as shown in FIG.
2
. If processor P
1
then unexpectedly crashes while still holding resource R
1
, P
2
remains in a wait state, aware that R
1
is busy, but not aware that P
1
has crashed. Thus, P
2
will wait forever for R
1
, which is being held by crashed processor P
1
, causing a deadlock state. Since prior art deadlock avoidance schemes are based on the assumption that processors will never malfunction, these schemes will not prevent deadlocks caused by crashing processors.
In any distributed or parallel computer system, access to shared resources is controlled by some form of a locking mechanism or scheme, whereby a shared resource is committed to, or “locked” by, a single holding processor until that processor releases the resource (i.e., releases the “lock”). To make such a system fault-tolerant, when a lock holding processor (i.e., a processor that has locked a resource) unexpectedly crashes, a rescuing method is needed to prevent system deadlock. The rescue operation will inherently be determined by the locking scheme. Problems may arise when two or more processors try to rescue the same locked resource. The correct behavior is that exactly one of them would succeed. Thus, there is a need for a generic fault-tolerant locking mechanism or scheme that can avoid deadlocks caused by failing processors in a multiprocessor system, especially mission-critical parallel systems where high availability is absolutely necessary. Such a mechanism will enable waiting processors to identify a locked resource held by a failed processor, and “rescue” the resource from the hold of its failed processor, by changing, or “re-setting,” a lock associated with the resource. A single waiting processor will then rescue a resource to prevent the potential deadlock from locking up all other processors waiting for the same resource.
SUMMARY OF THE INVENTION
In a preferred embodiment, the present invention is a match-and-set lock for controlling access to a resource that is shared among a plurality of users N. The lock has a locked operating state and an unlocked operating state controlled by a value C such that the lock is in its locked operating state when C≠0 and in its unlocked operating state where C=0. The lock returns a value R, equal to the lock's current content C, to an inquiring user seeking access to the resource. A return value R=0 usually denotes that the resource is free, and a return value R≠0 denotes that the resource is locked by another user. The lock is responsive to an atomic command in the form (A, B), such that the lock substitutes B for C if A=C. Thus, the lock may be obtained/locked by issuing the command (A, B) where A=C and B≠0; and the lock may be released by issuing the command (A, B) where A=C and B=0.
In accordance with the invention, a deadlock condition may be avoided by setting the lock to the value B=P+T*(N+1). Here N is the total number of users, P is an integer within [1, N] that identifies the current user issuing the command (A, B), and T is the current global time stamp. When the lock is set in this fashion, a return value of R≠0, identifies the user currently locking the resource (via P′=R mod (N+1)) and the time when that user locked the resource (via T′=R/(N+1)). If the inquiring user determines that the user currently locking the resource has failed or restarted since locking the resource, this inquiring user can reset the lock by issuing the command (R, B). Here B=P+T*(N+1), P identifies this new user issuing the command, and T is the current time stamp; in so doing, the inquiring user gains access to the resource. If there are multiple users trying to issue this type of command (R, B) (of course, with different values of B for each inquiring user), then exactly one of them will succeed. This is because the very first of them that has succeeded will have changed the content of the lock to something different from R′ causing all others to fail. When the inquiring user is finished with the resource, the user can reset the current content of the lock to C=0, to signal to other users that the resource is again free. The plurality of users m
AT&T Corp.
Chu Gabriel L.
LandOfFree
Fault-tolerant match-and-set locking mechanism for... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Fault-tolerant match-and-set locking mechanism for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fault-tolerant match-and-set locking mechanism for... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3242353