Method and apparatus for reliable disk fencing in a...

Electrical computers and digital processing systems: support – Multiple computer communication using cryptography – Protection at a particular protocol layer

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S005110, C714S006130

Reexamination Certificate

active

06243814

ABSTRACT:

The present invention relates to a system for reliable disk fencing of shared disks in a multicomputer system, e.g. a cluster, wherein multiple computers (nodes) have concurrent access to the shared disks. In particular, the system is directed to a high availability system with shared access disks.
BACKGROUND OF THE INVENTION
In clustered computer systems, a given node may “fail”, i.e. be unavailable according to some predefined criteria which are followed by the other nodes. Typically, for instance, the given node may have failed to respond to a request in less than some predetermined amount of time. Thus, a node that is executing unusually slowly may be considered to have failed, and the other nodes will respond accordingly.
When a node (or more than one node) fails, the remaining nodes must perform a system reconfiguration to remove the failed node(s) from the system, and the remaining nodes preferably then provide the services that the failed node(s) had been providing.
It is important to isolate the failed node from any shared disks as quickly as possible. Otherwise, if the failed (or slowly executing) node is not isolated by the time system reconfiguration is complete, then it could, e.g., continue to make read and write requests to the shared disks, thereby corrupting data on the shared disks.
Disk fencing protocols have been developed to address this type of problem. For instance, in the VAXcluster system, a “deadman brake” mechanism” is used. See Davis, R. J.,
VAXcluster Principles
(Digital Press 1993), incorporated herein by reference. In the VAXcluster system, a failed node is isolated from the new configuration, and the nodes in the new configuration are required to wait a certain predetermined timeout period before they are allowed to access the disks. The deadman brake mechanism on the isolated node guarantees that the isolated node becomes “idle” by the end of the timeout period.
The deadman brake mechanism on the isolated node in the VAXcluster system involves both hardware and software. The software on the isolated node is required to periodically tell the cluster interconnect adaptor (CI), which is coupled between the shared disks and the cluster interconnect, that the node is “sane”. The software can detect in a bounded time that the node is not a part of the new configuration. If this condition is detected, the software will block any disk I/O, thus setting up a software “fence” preventing any access of the shared disks by the failed node. A disadvantage presented by the software fence is that the software must be reliable; failure of (or a bug in) the “fence” software results in failure to block access of the shared disks by the ostensibly isolated node.
If the software executes too slowly and thus does not set up the software fence in a timely fashion, the CI hardware shuts off the node from the interconnect, thereby setting up a hardware fence, i.e. a hardware obstacle disallowing the failed node from accessing the shared disks. This hardware fence is implemented through a sanity timer on the CI host adaptor. The software must periodically tell the CI hardware that the software is “sane”. A failure to do so within a certain time-out period will trigger the sanity timer in CI. This is the “deadman brake” mechanism.
Other disadvantages of this node isolation system are that:
it requires an interconnect adaptor utilizing an internal timer to implement the hardware fence.
the solution does not work if the interconnect between the nodes and disks includes switches or any other buffering devices. A disk request from an isolated node could otherwise be delayed by such a switch or buffer, and sent to the disk after the new configuration is already accessing the disks. Such a delayed request would corrupt files or databases.
depending on the various time-out values, the time that the members of the new configuration have to wait before they can access the disk may be too long, resulting in decreased performance of the entire system and contrary to high-availability principles.
From an architectural level perspective, a serious disadvantage of the foregoing node isolation methodology is that it does not have end-to-end properties; the fence is set up on the node rather than on the disk controller.
It would be advantageous to have a system that presented high availability while rapidly setting up isolation of failed disks at the disk controller.
Other UNIX-based clustered systems use SCSI (small computer systems interface) “disk reservation” to prevent undesired subsets of clustered nodes from accessing shared disks. See, e.g., the ANSI SCSI-2 Proposed Standard for information systems (Mar. 9, 1990, distributed by Global Engineering Documents), which is incorporated herein by reference. Disk reservation has a number of disadvantages; for instance, the disk reservation protocol is applicable only to systems having two nodes, since only one node can reserve a disk at a time (i.e. no other nodes can access that disk at the same time). Another is that in a SCSI system, the SCSI bus reset operation removes any disk reservations, and it is possible for the software disk drivers to issue a SCSI bus reset at any time. Therefore, SCSI disk reservation is not a reliable disk fencing technique.
Another node isolation methodology involves a “poison pill”; when a node is removed from the system during reconfiguration, one of the remaining nodes sends a “poison pill”, i.e. a request to shut down,.to the failed node. If the failed node is in an active state (e.g. executing slowly), it takes the pill and becomes idle within some predetermined time.
The poison pill is processed either by the host adaptor card of the failed node, or by an interrupt handler on the failed node. If it is processed by the host adaptor card, the disadvantage is presented that the system requires a specially designed host adaptor card to implement the methodology. If it is processed by an interrupt handler on the failed node, there is the disadvantage that the node isolation is not reliable; for instance, as with the VAXcluster discussed above, the software at the node may itself by unreliable, time-out delays are presented and again the isolation is at the node rather than at the shared disks.
A system is therefore needed that prevents shared disk access at the disk sites, using a mechanism that both rapidly and reliably blocks an isolated node from accessing the shared disks, and does not rely upon the isolated node itself to support the disk access prevention.
SUMMARY OF THE INVENTION
The present invention utilizes a method and apparatus for quickly and reliably isolating failed resources, including I/O devices such as shared disks, and is applicable to virtually any shared resource on a computer system or network. The system of the invention maintains a membership list of all the active shared resources, and with each new configuration, such as when a resource is added or fails (and thus should be functionally removed), the system generates a new epoch number or other value that uniquely identifies that configuration at that time. Thus, identical memberships occurring at different epoch numbers, particularly if a different membership set has occurred in between.
Each time a new epoch number is generated, a control key value is derived from it and is sent to the nodes in the system, each of which stores the control key locally as its own node key. The controllers for the resources (such as disk controllers) also store the control key locally. Thereafter, whenever a shared resource access request is sent to a resource controller, the node key is sent with it. The controller then checks whether the node key matches the controller's stored version of the control key, and allows the resource access request only if the two keys match.
When a resource fails, e.g. does not respond to a request within some predetermined period of time (indicating a possible hardware or software defect), the membership of the system is determined anew, eliminating the failed resource. A new epoch number is generated, an

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for reliable disk fencing in a... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for reliable disk fencing in a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for reliable disk fencing in a... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2435701

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.