Method and apparatus for fail safe configuration

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06173420

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to configuring a cluster, and more specifically, to a method and apparatus for configuring on a cluster a software application that is not necessarily designed for execution on a cluster.
BACKGROUND OF THE INVENTION
A computer network typically includes a set of devices connected in a way that allows the devices to communicate with each other. Such devices, which can include workstations with memory and one or more processors, are often referred to as nodes. A cluster is a group of nodes that work together as a single system. One software application that allows groups of nodes to operate as a single system is NT Enterprise, which is generally available from Microsoft Corporation.
Clusters can be either “shared data” or “shared nothing” clusters. In a shared data cluster, all nodes have access to one or more shared storage devices. In a shared nothing cluster, storage devices are “owned” by nodes, and nodes only have access to the storage devices that they own.
In general, clustering technology is designed to minimize downtime for client/server network computing applications. Downtime may be minimized, for example, by shifting the responsibilities of a first node in the cluster to a second node in the cluster if the first node in the cluster fails. Shifting responsibilities in this manner is referred to as fail over. A node that assumes the responsibilities of another node in response to a fail over is referred to herein as a fail over node.
The responsibilities that a node is able to handle is determined in part by the software that is executing on the node. For example, a node may be able to process database requests because it is executing a database server. If the node fails, the responsibility for processing database requests can only be shifted to a fail over node that is able to execute the database server. Since the fail over node is not currently executing the database server, the database server must be started on the fail over node in response to the fail over. Techniques for performing automatic fail over in a client/server system are described in U.S. patent application Ser. No. 08/866,842 entitled “Automatic Failover for Clients Accessing a Resource Through a Server”, filed on May 30, 1997, the contents of which are incorporated herein by reference.
Many software programs must be specifically configured for a node before they can be safely executed on the node. Configuring a software program may involve, for example, (1) configuring the network required to run the client/server based application, (2) configuring the application itself, and (3) configuring any other software that may be required for the application to run. The process of configuring a software program for a node can be complex and time consuming. It typically requires the user to manually perform a series of steps specified by the software provider. For sophisticated software programs, the steps can be both numerous and complex. Further, if one step in the configuration process fails, the entire configuration operation may have to be restarted.
Applications designed to run on a single node are generally referred to as stand alone applications. An application that runs in a cluster environment and is capable of fail over to another node in the cluster when the primary node fails is referred to as a fail safe application.
Before a stand alone application is configured for fail safe operation, the application can only run on one of the clustered nodes. This node is referred to as the owner node. Fail safe operation requires the application to be configured both on the owner node and on other nodes in the cluster so that the application can run on multiple nodes in the cluster to provide fail over capability.
In fail over systems, software programs must be configured on both (1) nodes that will initially execute the programs, and (2) nodes that may have to execute the programs if fail over occurs. Thus, depending on the fail over policies employed within a cluster, a given software program may have to be configured on all of the nodes in a cluster even though it is planned to be executed on only one of the nodes in the cluster at a time.
A configuration operation becomes exponentially more complex and time consuming the more nodes for which the program must be configured. Consequently, configuring applications for use on clusters that employ fail over can be prohibitively burdensome. For example, one software program has a forty-step configuration process. Configuring such a program on a relatively small cluster of nodes has taken an expert engineer approximately nineteen hours.
Based on the foregoing, it is clearly desirable to reduce the complexity of configuring software in clusters that employ fail over policies.
SUMMARY OF THE INVENTION
A method and apparatus for turning a stand alone application into a fail safe application automatically with minimum expertise required of the user of the application. According to one aspect of the invention, a configuration coordinator executing on a configuration manager communicates with one or more configuration slaves executing on a set of nodes that are operating as a cluster. The configuration coordinator sends messages to the one or more configuration slaves to initiate a configuration operation for a software application. The configuration coordinator generates log information to track which configuration slaves have initiated and completed configuration operations.
Each configuration slave automatically performs a series of actions to configure the node on which it resides. While performing the series of actions, the configuration slaves generate logs that reflect their progress in performing the series of actions. If a problem occurs during performance of the series of actions, the configuration slave that encounters the problem indicates to the configuration coordinator that an error occurred. The configuration coordinator responds to the error by causing the configuration slaves to roll back changes made during performance of the series of actions. The configuration slaves that have begun but not completed the series of actions inspect their logs to determine which changes to roll back.
By automatically configuring software on a cluster, and automatically rolling back changes on all cluster nodes in the event of an error during the configuration process, the cluster configuration process is made atomic, automatic, and significantly faster and less error-prone than manual cluster-wide configuration operations.


REFERENCES:
patent: 3444528 (1969-05-01), Lovell et al.
patent: 4868832 (1989-09-01), Marrington et al.
patent: 5157663 (1992-10-01), Major et al.
patent: 5179660 (1993-01-01), Devany et al.
patent: 5247664 (1993-09-01), Thompson et al.
patent: 5535326 (1996-07-01), Baskey et al.
patent: 5544313 (1996-08-01), Shachnai et al.
patent: 5566225 (1996-10-01), Haas
patent: 5566297 (1996-10-01), Devarakonda et al.
patent: 5596720 (1997-01-01), Hamada et al.
patent: 5633999 (1997-05-01), Clowes et al.
patent: 5652908 (1997-07-01), Douglas et al.
patent: 5666479 (1997-09-01), Kashimoto et al.
patent: 5696895 (1997-12-01), Hemphill et al.
patent: 5721918 (1998-02-01), Nilsson et al.
patent: 5734896 (1998-03-01), Rizvi et al.
patent: 5784630 (1998-07-01), Saito et al.
patent: 5796934 (1998-08-01), Bhanot et al.
patent: 5796999 (1998-08-01), Azagury et al.
patent: 5819019 (1998-10-01), Nelson
patent: 5832483 (1998-11-01), Barker
patent: 5850507 (1998-12-01), Ngai et al.
patent: 5862362 (1999-01-01), Somasegar et al.
patent: 5867713 (1999-02-01), Shrader et al.
patent: 5870545 (1999-02-01), Davis et al.
patent: 5951694 (1999-09-01), Choquier et al.
patent: 6038677 (2000-03-01), Lawlor et al.
patent: 6047323 (2000-04-01), Krause
High Availability in Clustered Multimedia Servers (IEEE) Tewari, R; Dias, D. M.; Mukherzee, R; Vin, H. M., Mar., 1996.
“Coda: A Highly Available File System for a Distributed Workstation Environment”, Mahadev Satyanarayanan, et al., IEEE Transactions on Computers, vol. 39, No. 4,

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for fail safe configuration does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for fail safe configuration, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for fail safe configuration will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2509879

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.