Cluster node distress signal

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C709S220000

Reexamination Certificate

active

06442713

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Technical Field
This invention generally relates to clustering computers, and more specifically relates to distress signaling for cluster communications.
2. Background Art
Society depends upon computer systems for many types of information in this electronic age. Based upon various combinations of hardware (e.g., semiconductors, circuit boards, etc.) and software (e.g., computer programs), computer systems vary widely in design. Many computer systems today are designed to “network” with other computer systems. Through networking, a single computer system can access information stored on and processed by other computer systems. Thus, networking results in greater numbers of computer systems having access to greater numbers of electronic resources.
Networking is made possible by physical “routes” between computer systems, and the use of agreed upon communications “protocols.” What protocol is chosen depends upon factors including the number of networked computer systems, the distances separating the computer systems, and the purposes of information exchange between the computer systems. Communications protocols can be very simplistic if only a few computer systems are networked together at close proximity. However, these communications protocols become more sophisticated as greater numbers of computer systems are added, and as computer systems are separated by greater distances.
The sophistication of communications protocols also varies with the type of information exchange. For instance, some protocols emphasize accuracy in sending large amounts of information, while others emphasize the speed of information transfer. The communications requirements of the applications running on a computer system network determine what type of protocol is chosen. An example of a computer application requiring real-time, reliable information transfer is a “cluster” management application.
Clustering is the networking of computer systems for the purpose of providing continuous resource availability and for sharing workload. A cluster of computer systems appears as one computer system from a computer system user's perspective, but actually is a network of computer systems backing each other up. In the event of an overload or failure on one computer system in a cluster, cluster management applications automatically reassign processing responsibilities for the failing computer system to another computer system in the cluster. Thus, from a user's perspective there is no interruption in the availability of resources.
Typically, one node in the cluster is assigned primary responsibility for an application (e.g., database, server) and other nodes are assigned backup responsibility. When the primary node for an application fails, the back up nodes in the cluster take over responsibility for that application. This ensures the high availability of that application.
Clustering is made possible through cluster management application programs running on each computer system in a cluster. These applications relay cluster messages back and forth across the cluster network to control cluster activities. Cluster messaging is also used to distribute updates about which computer systems in the cluster have what primary and back-up responsibilities.
To ensure the high availability of applications running on the cluster, the cluster needs to be able to keep track of the status of all the nodes on a cluster. To do this, each computer system in a cluster continuously monitors each of the other computer systems in the same cluster to ensure that each is alive and performing the processing assigned to it. Thus, when a node on a cluster fails, its primary responsibilities can be assigned to the backup nodes.
Unfortunately, it is not always possible to tell that a node in the cluster has failed. For example, if the network connection between one node and the rest of the cluster fail, the cluster will no longer be able to tell if that node is operating properly. If a node is still operating but its network connection to other nodes in the cluster has failed, then the node is said to have been “partitioned” from the cluster. When a node unexpectedly stops communicating with the rest of the cluster it cannot be easily determined whether the node has failed or instead has been merely partitioned from the rest of the cluster. If the cluster incorrectly assumes the node has failed, and assigns the backup node primary responsibility for the application, the cluster can will have two nodes both believing that they are the primary node. This can result in data inconsistencies in the database as both nodes respond to requests to the cluster. If on the other hand, the cluster incorrectly assumes the node is still performing its primary applications and has only been partitioned from the cluster, and does not assign primary responsibility to the back up node, then those applications will no longer be available to the clients of the cluster. Thus, in many cases the cluster is unable to correctly respond to a non-communicating node without manual intervention by administrators.
As more resources become accessible across computer system networks, the demand for continuous access to such network resources will grow. The demand for clusters as a means to provide continuous availability to such network resources will grow correspondingly. Without improved methods for determining the status of cluster nodes, the continuous availability these resources will not be fully realized.
DISCLOSURE OF INVENTION
According to the present invention, a cluster node distress system is provided that improves the reliability of a cluster. The cluster node distress system provides a cluster node distress signal when a node on the cluster is about to fail. This allows the cluster to better determine whether a non-communicating node has failed or has merely been partitioned from the cluster. The preferred cluster node distress system is embedded deeply into the operating system and provides a pre-built node distress signal that can be quickly sent to other nodes in the cluster when an imminent failure of that node is detected. This improves the probability that the node distress signal will get out before the node totally fails. When the node distress signal is effectively sent to the cluster, the cluster can accurately determine that the node has failed and has not just partitioned from the cluster. This allows the cluster to respond correctly, i.e., by assigning other nodes primary responsibility, and requires less intervention by administrators. Thus, the preferred embodiment provides improved cluster reliability and decreased reliance on administrators.
The foregoing and other features and advantages of the invention will be apparent from the following more particular description as set forth in the preferred embodiments of the invention, and as illustrated in the accompanying drawings.


REFERENCES:
patent: 4486826 (1984-12-01), Wolff et al.
patent: 4654857 (1987-03-01), Samson et al.
patent: 5117352 (1992-05-01), Falek
patent: 5371852 (1994-12-01), Attanasio et al.
patent: 5440726 (1995-08-01), Fuchs et al.
patent: 5590277 (1996-12-01), Fuchs et al.
patent: 5627962 (1997-05-01), Goodrum et al.
patent: 5805785 (1998-09-01), Dias et al.
patent: 6122735 (2000-09-01), Steiert et al.
patent: 6151688 (2000-11-01), Wipfel et al.
patent: 6192483 (2001-02-01), Moiin et al.
patent: 6243814 (2001-06-01), Matena

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Cluster node distress signal does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Cluster node distress signal, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Cluster node distress signal will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2918080

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.