Data processing: database and file management or data structures – Database design – Data structure types
Reexamination Certificate
2001-05-14
2004-08-31
Channavajjala, Srirama (Department: 2177)
Data processing: database and file management or data structures
Database design
Data structure types
C707S793000, C707S793000, C707S793000, C709S226000, C709S223000, C709S219000, C709S224000, C709S242000, C714S002000, C714S004110, C714S025000
Reexamination Certificate
active
06785678
ABSTRACT:
BACKGROUND OF THE INVENTION
1. The Field of the Invention
This invention relates to computer clustering systems and in particular to methods for improving the availability and reliability of computer clustering system resources and data in the event of loss of communication between computer clustering system servers.
2. Description of Related Art
A typical computer cluster includes two or more servers and one or more network devices in communication with each other across a computer network. During normal operation of a computer cluster, the servers provide the network devices with computer resources and a place to store and retrieve data. In current computer cluster configurations the computer cluster data is stored on a shared computer disk that is accessed by any of the network servers.
A typical computer cluster is illustrated in
FIG. 1
, which illustrates two network servers
110
and
120
in communication with network devices
130
,
140
, and
150
across computer network
101
. Both network server
110
and network server
120
communicate with shared disk
104
across communication lines
105
and
106
, respectively.
When using a computer cluster, it is often desirable to provide continuous availability of computer cluster resources, particularly where a computer cluster supports a number of user workstations, personal computers, or other network client devices. It is also often desirable to maintain uniform data between different file servers attached to a computer clustering system and maintain continuous availability of this data to client devices. To achieve reliable availability of computer cluster resources and data it is necessary for the computer cluster to be tolerant of software and hardware problems or faults. Having redundant computers and a mass storage device generally does this, such that a backup computer or disk drive is immediately available to take over in the event of a fault.
A technique currently used for implementing reliable availability of computer cluster resources and data using a shared disk configuration as shown in
FIG. 1
involves the concept of quorum, which relates to a state in which one network server controls a specified minimum number of network devices such that the network server has the right to control the availability of computer resources and data in the event of a disruption of service from any other network server. The manner in which a particular network server obtains quorum can be conveniently described in terms of each server and other network devices casting “votes”. For instance, in the two server computer cluster configuration of
FIG. 1
, network server
110
and network server
120
each casts one vote to determine which network server has quorum. If neither network server obtains a majority of the votes, shared disk
104
then casts a vote such that one of the two network servers
110
and
120
obtains a majority, with the result that quorum is obtained by one of the network servers in a mutually understood and acceptable manner. Only one network server has quorum at any time, which ensures that only one network server will assume control of the entire network if communication between the network servers
110
and
120
is lost.
The use of quorum to attempt to make network servers available in the event of a disruption will now be described. There are two general reasons for which server
110
can detect a loss of communication with server
120
. The first is an event, such as a crash, at server
120
, in which server
120
is no longer capable of providing network resources to clients. The second is a disruption in the communication infrastructure of network
101
between the two servers, with server
120
continuing to be capable of operating within the network. If server
110
can no longer communicate with server
120
, its initial operation is to determine if it has quorum. If server
110
determines that it does not have quorum, it then attempts to get quorum by sending a command to shared disk
104
requesting the disk to cast a vote. If shared disk
104
does not vote for server
110
, this server shuts itself down to avoid operating independently of server
120
. In this case, server
110
assumes that network server
120
is operating with quorum and server
120
continues to control the computer cluster. However, if shared disk
104
votes for network server
110
, this server takes quorum and control of the computer cluster and continues operation under the assumption that network server
120
has malfunctioned.
While the use of quorum to enable one of a plurality of network servers to continue providing network resources in the event of a disruption in the network is often satisfactory, the use of a shared disk places the entire network and the data stored on the disk at risk of being lost. For instance, if the shared disk
104
, rather than one of the network servers
110
and
120
malfunctions, neither of the servers can operate, and the data may be permanently lost. Moreover, in a shared disk configuration the computer cluster servers are typically placed in close proximity to each other. This creates the possibility that natural disasters or power failures may take down the whole computer cluster.
SUMMARY OF THE INVENTION
The present invention relates to a method for improving the availability and reliability of computer cluster resources and data in a computer clustering system. Two servers each having an associated disk communicate across a computer network. Each server is capable of providing computer cluster resources and accessing computer cluster data for all network devices attached to the computer network. In the event of loss of communication, each server has the ability to determine the reason for loss of communication and determine whether or not it should continue operation.
When a network server detects that communication with another network server is lost, the loss in communication can be due to either a failure of the communication link or a failure of the other network server. Because each network server has a mirrored copy of the network data, a loss in communication is followed by execution of a series of acts at each network server that remains operating to ensure that the network servers do not begin operating independently of each other. In the absence of these acts, multiple network servers operating independently of one another could exist in an undesirable “split brain” mode, in which data mirroring between the network servers is not performed, thereby resulting in potential data corruption.
When operation of the computer cluster is initiated, one server is assigned control of the computer cluster resources and data and is given a “right to survive” in the event that communication between the network servers is lost as a result in failure of the communication link. For convenience, the one network server that has the “right to survive” during normal operation is designated herein as a “primary” server and any server that is not does not have the right to survive during normal operation is designated as a “secondary” server. It is noted that the terms “primary” and “secondary” do not connote relative importance of the servers, nor do they refer to which server is primarily responsible for providing network resources to network devices. Under normal operation, primary and secondary servers can be interchangeable from the standpoint of providing network resources. The right to survive is used in a default protocol to ensure that the split brain problem does not arise if communication between network servers is lost.
When a primary network server detects loss of communication, the primary network server can continue operating, since it can assume that the other, secondary network server has failed or that the secondary network server will not continue operation. The series of acts performed by a secondary network server upon detecting loss of communication is somewhat more complex. Rather than simply ceasing operation, the secondary network server infers or determines w
Channavajjala Srirama
EMC Corporation
Lu Kuen S.
Workman Nydegger
LandOfFree
Method of improving the availability of a computer... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method of improving the availability of a computer..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method of improving the availability of a computer... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3357619