Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-09-05
2004-03-02
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S038110, C709S223000, C709S224000, C717S124000, C717S127000
Reexamination Certificate
active
06701463
ABSTRACT:
FIELD OF THE INVENTION
This invention generally relates to embedded software packages in distributed computer systems and more particularly to an improved system and method for system recovery and migration of services in the event of a failure.
BACKGROUND OF THE INVENTION
Distributed computer systems store enormous amounts of information that can be accessed by users for identification and retrieval of valuable documents that contain control, data, text, audio and video information. A typical example of a distributed system (
100
) is shown in
FIG. 1. A
distributed computer system consists of computer nodes (
104
a
to
104
n
and
108
a
to
108
z
) and a communication network (
102
) that allows the exchange of messages between computer nodes. The communication network (
102
) may be any of the following: a local area network (LAN), a wide area network (WAN), a corporate intranet, the Internet, a wireless network, a cabling network or equivalents. Multiple storage devices (
106
a
to
106
n
and
110
a
to
110
z
) store data files for the multiple nodes of the distributed system. Storage devices (
106
a
to
106
n
) are local storage for the nodes (
104
a
to
104
n
); storage devices (
110
a
to
110
z
) are global databases which are accessible by nodes (
108
a
to
108
z
); these are considered to belong to a storage or disk “farm” (
112
) of shared non-volatile memory. These nodes work together to achieve a common goal (e.g., a parallel scientific computation, a distributed database, control of multiple robots in a manufacturing plant or a parallel file system). In particular, nodes (
108
a
to
108
z
) act as control servers in a manufacturing plant; they control and communicate with local nodes (
104
a
to
104
n
) in order to effect control of an industrial process or device. Such devices, which are not shown in
FIG. 1
, include robots, printers, video devices, audio devices, tape devices, storage devices or their equivalents.
FIG. 2
(
200
) illustrates the composition of a processing server node (
202
) utilized by some distributed processing node implementations. As shown, the node (
202
) contains a software package (
204
a
) that effects control and communication with the devices mentioned above. Further, package (
204
a
) comprises other services (
208
a
) and an embedded monitor software subroutine (
206
a
). Other services (
208
a
) are the part of the package (
204
a
) that performs device communication and control. Monitor subroutine (
206
a
) monitors the functionalities of the package (
204
a
). The package (
204
b
) as shown in an expanded diagram includes other services (
208
b
) that typically executes:
Mount/home
1
Export (share)/home
1
Service File I/O Requests for/home
1
Start Monitor
and a monitor subroutine (
206
b
) that typically executes monitoring functions including:
Periodically Verifying I/O Daemon Responsiveness
If necessary, Restarting Daemons and Re-Export/home
1
However, a problem arises in that this embedded monitor subroutine (
206
a
), because it is embedded in the package it is monitoring, has no knowledge of other packages. So, if there is a package running on each node of a multiple node cluster, and one node fails, its package must move to another node. If both of these packages contained embedded monitors that were monitoring the respective packages, and a problem occurred that required corrective action, they would compete against each other by trying to restart resources. For example, whichever starts a recovery process first would attempt to restart some process. This process restart in turn is detected by a second monitor as a failure since the state of the first package and monitor is unknown to the second monitor. Thus, the second monitor would now attempt to restart its process and the errors would accrue successively.
A typical processing node cluster software that uses this implementation is Hewlett Packard's (Palo Alto, Calif.) MC/ServiceGuard. In this software, the whole purpose of the packages is to form a collection of services that can move from one host or machine to another. This migration of services can be precipitated by a total nodal failure (i.e., an equipment failure like a node caught on fire), the result of planned maintenance on one of the nodes or for the purpose of load balancing. The services contained within the nodes are grouped into packages as previously described; a given package is any combination of programs or data. Although service migration occurs for some failures, not all failures actually necessitate a migration of services; rather, a program which has died may be restarted by an automated watchdog process for example. A package monitoring program automatically performs this watchdog process in the Hewlett Packard implementation; there, the monitor is launched by the package it intends to monitor.
In addition, the cluster software is controlled by an operating system containing a network file system (nfs, SUN Microsystems, Palo Alto, Calif.). This network file system comprises a plurality of processes including, but not limited to: a) nfsd, “nfs daemon”, b) rpc mountd, “remote procedure call mount daemon”, and c) rpc statd, “remote procedure call status daemon”. These are all part of the operating system that allows “nfs” (SUN's network file system) to work. These processes are the ones that are monitored for a file sharing package. However, it is understood that the monitored processes are anything that is required for a given package to perform its functionality.
The above architecture present two major problems, namely, a) the monitor can't be terminated without stopping the package it is monitoring, and b) only one package with similar attributes can be running at any given time on a node. First, if there are any adjustments required in the monitor itself (timing issues, retries or the equivalent) the client services provided by the package running the monitor must be interrupted. Because the goal of MC/ServiceGuard is to provide high availability for server resources, stopping a package even for a short period of time is undesirable. Second, if there were two or more packages running with similar attributes each of these could affect processes that the other is watching. As a result, an endless loop of erroneous attempts at some corrective action prevents one server from taking over the resources of another server. Prior attempts to resolve this problem include maintaining a normally idle standby server or dedicating a functionality to a specific server. However, neither of these choices is cost effective nor permits distributed dissemination of packages. What is needed is a hardware or software implementation that solves the problems of: a) that the monitor can't be updated since it can't be terminated without stopping the package it is monitoring, and b) only one package with similar attributes can be running at any given time on a node.
REFERENCES:
patent: 5793977 (1998-08-01), Schmidt
patent: 5983316 (1999-11-01), Norwood
patent: 6088727 (2000-07-01), Hosokawa et al.
patent: 6393485 (2002-05-01), Chao et al.
patent: 6421737 (2002-07-01), Stone et al.
Beausoliel Robert
Garrett Scott M.
Motorola Inc.
Puente Emerson
LandOfFree
Host specific monitor script for networked computer clusters does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Host specific monitor script for networked computer clusters, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Host specific monitor script for networked computer clusters will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3228530