Method and apparatus for cluster system operation

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C709S250000, C340S315000, C370S225000

Reexamination Certificate

active

06502203

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to clusters of computers, and particularly to data communication between computers in a cluster system.
BACKGROUND AND SUMMARY OF THE INVENTION
A “cluster” is a collection of computers that work in concert to provide a much more powerful system. (More precisely, a cluster can be defined as a set of loosely coupled, independent computer nodes, which presents a single system image to its clients.) Clusters have the advantage that they can grow much larger than the largest single node, they can tolerate node failures and continue to offer service, and they can be built from inexpensive components. The nodes of the cluster may communicate over network interconnects which are also used to let the cluster communicate with other computers, or the cluster architecture may include a dedicated network which is only used by the nodes of the cluster. As the price of microprocessors has declined, cluster systems based on personal computers have become more attractive. The present application discloses a new cluster architecture.
Background: Particular Problems in Cluster Systems
A key necessity in cluster operations is to coordinate the operation of the members of the cluster. One very fundamental element of this is that each member of a cluster needs to know whether the cluster is still operating, and if so who the other members of the cluster are. This basic need is referred to as “quorum validation.”
A current method for quorum validation, i.e. the process of verifying that the cluster members are still present, is to send messages (called “heartbeats”) to other nodes to obtain mutual consent on the agreed upon list of cluster members. These heartbeats include both substantive messages sent by a node, and other messages which simply indicate that the sending node is still connected to the network and functioning correctly. Loss of messaging capability between any nodes (or groups of nodes) in a cluster can be detected by the loss of heartbeats from a given node. When such a loss of messaging occurs, the remaining cluster nodes will attempt to create a new member list of the cluster. This activity is called quorum negotiation.
This capability makes the cluster organization robust. However, this can lead to a particularly acute problem. In some cases loss of messaging can result in multiple partitioned subclusters which all agree on subsets of membership, and which form multiple, partitioned clusters independently operating on, modifying, or creating data intended to be part of a larger data set common to the whole cluster. This is often referred to as the “split brain syndrome,” in that the cluster splits into two disjoint parts that both claim to own some system resource. Eventual rejoining of the independent partitioned clusters into one large cluster will cause the datasets to be re-synchronized, often requiring one or more sets of modified data to be discarded, resulting in a loss of data integrity and a waste of processing cycles. Other details of cluster systems may be found in many publications, including The Design and Architecture of the Microsoft Cluster Service, Proceedings of the FTCS 1998, which is hereby incorporated by reference.
Method and Apparatus for Cluster System Operation
The present application discloses a cluster system and a method of quorum negotiation, utilizing communication over a power mains to provide a secondary communication channel. This secondary channel does not replace the primary channel, which is still a standard or high-speed network system. If the heartbeat is lost over the primary communication system, the secondary channel can be used to check the heartbeat to validate whether or not the “lost” system is still in operation. If communication cannot be established over the power mains, it is assumed that the “lost” system is down and should be dropped from any cluster.
In addition, this methodology can be used to reset components or nodes, and to guarantee that the split system is reset. This technique also overcomes the problem of determining whether the entire cluster or just member systems should be reset.


REFERENCES:
patent: 3911415 (1975-10-01), Whyte
patent: 4300126 (1981-11-01), Gajjar
patent: 4792731 (1988-12-01), Pearlman et al.
patent: 4912723 (1990-03-01), Verbanets, Jr.
patent: 4926415 (1990-05-01), Tawara et al.
patent: 5059871 (1991-10-01), Pearlman et al.
patent: 5144666 (1992-09-01), Le Van Suu
patent: 5301311 (1994-04-01), Fushimi et al.
patent: 5313584 (1994-05-01), Tickner et al.
patent: 5352957 (1994-10-01), Werner
patent: 5382951 (1995-01-01), White et al.
patent: 5453738 (1995-09-01), Zirkl et al.
patent: 5588002 (1996-12-01), Kawanishi et al.
patent: 5608446 (1997-03-01), Carr et al.
patent: 5673384 (1997-09-01), Hepner et al.
patent: 5805926 (1998-09-01), Le Van Suu
patent: 5857087 (1999-01-01), Bemanian et al.
patent: 5859596 (1999-01-01), McRae
patent: 5999712 (1999-12-01), Moiin et al.
patent: 6173318 (2001-01-01), Jackson et al.
patent: 6192401 (2001-02-01), Modiri et al.
patent: 6202080 (2001-03-01), Lu et al.
patent: 6272551 (2001-08-01), Martin et al.
patent: 6279032 (2001-08-01), Short et al.
patent: 6324161 (2001-11-01), Kirch
Neuron Chip Local Operating Network LSls, 1998 pp. 1-20.*
X-10 Powerhouse.*
Siec LON (Local Operating Network) firmy Echelon, 1995 pp. 1-22.*
“Sun™ Clusters”, A White Paper, from www.sun.com/clusters/, Oct. 1997, 26 pages.
“Architecture Cluster Workshop”, www.sei.cmu.edu/community/edcs. Oct. 1996, 11 pages.
“High-Performance Networks, Clusters, and Interoperability”, www.hcs.ufl.edu, 3 pages.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for cluster system operation does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for cluster system operation, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for cluster system operation will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2992018

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.