Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2001-05-02
2004-11-23
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S015000, C714S006130
Reexamination Certificate
active
06823474
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a method and system for providing cluster replicated checkpoint services. In particular, the present invention relates to a cluster replicated checkpoint service (“CRCS”), which provides services for components to maintain checkpoint and its replicas. In so doing, the CRCS allows components to recover promptly and seamlessly from failures, and thus ensures high-availability of services provided by them.
2. Discussion of the Related Art
Networked computer systems enable users to share resources and services. One computer can request and use resources or services provided by another computer. The computer requesting and using the resources or services provided by another computer is typically known as a client, and the computer providing resources or services to another computer is known as a server.
A group of independent network servers may be used to form a cluster. Servers in a cluster are organized so that they operate and appear to clients, as if they were a single unit. A cluster and its network may be designed to improve network capacity, by among other things, enabling the servers within a cluster to shift work in order to balance the load. By enabling one server to take over for another, a cluster may be used to enhance stability and minimize downtime caused by an application or system failure.
Today, networked computer systems including clusters are used in many different aspects of our daily lives. They are used, for example, in business, government, education, entertainment, and communication. As networked computer systems and clusters become more prevalent and our reliance on them increases, it has become increasingly more important to achieve the goal of always-on computer networks, or “high-availability” systems.
High-availability systems need to detect and recover from a failure in a way transparent to its users. For example, if a server in a high-availability system fails, the system must detect and recover from the failure with no or little impact on clients.
Various methods have been devised to achieve high availability in networked computer systems including clusters. For example, one method known as triple module redundancy, or “TMR,” is used to increase fault tolerance at the hardware level. Specifically, with TMR, three instances of the same hardware module concurrently execute and by comparing the results of the three hardware modules and using the majority results, one can detect a failure of any of the hardware modules. However, TMR does not detect and recover from a failure of software modules. Another method for achieving high availability is software replication, in which a software module that provides a service to a client is replicated on at least two different nodes in the system. While software replication overcomes some disadvantages of TMR, it suffers from its own problems, including the need for complex software protocols to ensure that all of the replicas have the same state.
The use of replication of hardware or software modules to achieve high-availability raises a number of new problems including management of replicated hardware and software modules. The management of replicas has become increasingly difficult and complex, especially if replication is done at the individual software and hardware level. Further, replication places a significant burden on system resources.
When replication is used to achieve high availability, one needs to manage redundant components and have an ability to assign work from failing components to healthy ones. However, telling a primary component to restart or a secondary component to take over, is not sufficient to ensure continuity of services. To achieve a seamless fail-over, the successor needs to pick-up where the failing component left off. This means that secondary components need to know what the last stable state of the primary component was.
One way of passing information regarding the state of the primary component is to use checkpoints. A checkpoint may be a file containing information that describes the state of the primary component at a particular time. Because checkpoints play a crucial role in achieving high-availability, there is a need for a system and method for providing reliable and efficient cluster replicated checkpoint services to achieve high availability.
SUMMARY OF THE INVENTION
The present invention provides a system and method for providing cluster replicated checkpoint services. In particular, the present invention provides a cluster replicated checkpoint service for managing a checkpoint and its replicas to make a cluster highly available.
To achieve these and other advantages and in accordance with the purposes of the present invention, as embodied and broadly described herein, the present invention describes a method for providing cluster replicated checkpoint services for replicas of a checkpoint in a cluster. The cluster includes a first node and a second node, which are connected to one another via a network. The replicas include a primary replica and a secondary replica. The method includes managing the checkpoint that contains checkpoint information, and creating the primary replica in a memory of the first node. The primary replica contains first checkpoint information. The method also includes updating the primary replica so that the first checkpoint information corresponds to the checkpoint information, creating the secondary replica that contains second checkpoint information in a memory of the second node, and updating the secondary replica so that the second checkpoint information corresponds to the checkpoint information.
In another aspect, the invention includes a method for providing cluster replicated checkpoint services for replicas of a checkpoint in a cluster. The cluster includes a first node and a second node, which are connected to one another via a network. The replicas include a primary replica and a secondary replica. The method includes creating the checkpoint, opening the checkpoint from the first node in a write mode, and creating the primary replica in a memory of the first node. It also includes updating the checkpoint, updating the primary replica, and propagating a checkpoint message that includes information regarding the checkpoint. Further, the method includes opening the checkpoint from the second node in a read mode, creating the secondary replica in a memory of the second node, and updating the secondary replica based on the checkpoint message.
In yet another aspect, the invention includes a computer program product configured to provide cluster replicated checkpoint services for replicas of a checkpoint in a cluster. The cluster includes a first node and a second node, which are connected to one another via a network. The replicas include a primary replica and a secondary replica. The computer program product includes computer readable program codes configured to: (1) manage the checkpoint that contains checkpoint information; (2) create the primary replica with first checkpoint information in a memory of the first node; (3) update the primary replica so that the first checkpoint information corresponds to the checkpoint information; (4) create the secondary replica with second checkpoint information in a memory of the second node; and (5) update the secondary replica so that the second checkpoint information corresponds to the checkpoint information. The computer program product also includes a computer readable medium in which the computer readable program codes are embodied.
In further aspect, the invention includes a computer program product configured to provide cluster replicated checkpoint services for replicas of a checkpoint in a cluster. The cluster includes a first node and a second node, which are connected to one another via a network. The replicas include a primary replica and a secondary replica. The computer program product includes computer readable program codes configured to: (1) create the checkpoint; (2) open the checkpoint from
Brossier Stephane
Herrmann Frederic
Kampe Mark A.
Beausoliel Robert
Duncan Marc M
Hogan & Hartson LLP
Kubida William J.
Lembke Kent A.
LandOfFree
Method and system for providing cluster replicated... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for providing cluster replicated..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for providing cluster replicated... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3360462