Retry mechanism for remote operation failure in distributed...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C370S403000

Reexamination Certificate

active

06286111

ABSTRACT:

BACKGROUND OF THE INVENTION
The present invention is directed to managing a large distributed computer enterprise environment. More particularly, it relates to retrying a failed operation in a distributed computing environment.
It is well known to couple computer systems together by means of a network such as a local area network (LAN) or wide area network (WAN) to obtain access to computing resources located on a remote computer system. It is generally not economical feasible to provide a printer and expensive DASD at each user workstation. By connecting the resources of the entire network together and making these selectively available to users, a much greater and more efficient collection of resources can be mustered than would be possible if all resources were to be provided at each desktop.
However, managing a computer network comprising hundreds or even thousands of nodes to provide such computing resources can produce serious difficulties for system administrators. Management tasks, such as distribution of system-wide changes, must be carried out quickly and in a dependable manner in order to reduce the probability of catastrophic failure. Typically, a system operation is initiated at a central location, e.g., an administrator's workstation and invoked on one or more remote machines in the network. Preferably, system operations are invoked on a group or subnet of machines in a single operation. Yet distributed computing environments that are known in the art do not scale easily to large size.
There are many reasons why an operation invoked on a remote machine may fail including network failure and incompatible command syntax with the remote machine. One other reason that the operation may fail is that the target machine is down, either because the user has turned the machine off or some program, such as a power management program, has powered down the machine. To complete a software distribution, it is important to be able to do so over the entire network as quickly as possible.
The present invention addresses and solves these problems.
SUMMARY OF THE INVENTION
The present invention provides a mechanism for retrying a system operation on a remote node in a distributed environment. In an “optimistic” embodiment of the invention, a local system issues a set of commands over a network to a remote node to perform a system operation. Responsive to a failure by the remote node to perform the requested system action, the retry mechanism determines whether the remote node could be in a “node-down” or similar nonoperational state. If the remote node could be in the “node-down” state, the system issues a magic packet to the remote node. Next, the system waits a predetermined period of time for the remote node to be brought to a fully operational state. The system issues the set of commands a second time to the remote node to perform the system operation. In a “preemptive” or “pessimistic” embodiment of the invention, the likelihood that the remote node is in a “node-down” or similar state is sufficiently high to outweigh the cost of sending a magic packet over the network. Thus, expecting failure of a request for a system operation on a remote node, a magic packet is issued preemptively to the remote node over a network to bring it to a fully operational state. Then, the local system issues a set of commands over the network to the remote node to perform the system operation.


REFERENCES:
patent: 5515508 (1996-05-01), Pettus et al.
patent: 5568402 (1996-10-01), Gray et al.
patent: 5583793 (1996-12-01), Gray et al.
patent: 5781908 (1998-07-01), Williams et al.
patent: 5802305 (1998-09-01), McKaughan et al.
patent: 5915119 (1999-06-01), Cone
patent: 5938771 (1999-08-01), Williams et al.
patent: 5987521 (1999-11-01), Arrowood et al.
patent: 6088729 (2000-07-01), McCrory et al.
patent: 6088770 (2000-07-01), Tarui et al.
patent: 6098100 (2000-08-01), Wey et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Retry mechanism for remote operation failure in distributed... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Retry mechanism for remote operation failure in distributed..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Retry mechanism for remote operation failure in distributed... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2437615

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.