System and method for monitoring the state and operability...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06721907

ABSTRACT:

TECHNICAL FIELD
The present invention relates generally to computing systems, and more particularly to a system and method for monitoring the state and operability of components in distributed computing systems. The present invention indicates whether a component is operating correctly, and reliably distributes the state of all components among the elements of the system.
BACKGROUND OF THE INVENTION
In any distributed computing system, it is desirable to monitor the state of the various components (e.g., to know which components are operating correctly and to detect which ones are not operable). It is further desirable to distribute the state of all components among the elements of the system.
In known prior art, “heartbeats” sometimes referred to as “I'm alive” packets are used to distribute the state of all components. Particularly, these types of packets are employed in computing systems that use a point-to-point messaging mechanism, and in cluster membership services that use a type of ring topology where messages are sent from one machine to the next in a chain including a list of current members. However, in all of these prior implementations, each machine sends a packet to every other machine, thereby requiring an N
2
algorithm to distribute state information. To reduce the number of messages from order N
2
to order n, the present invention uses a reliable multicast protocol to distribute state information.
SUMMARY OF THE INVENTION
According to the disclosed embodiments, a method and system is provided for determining whether a given component in a distributed computing system is operating correctly, and for reliably distributing the state of the components among all the elements of the system.
One non-limiting advantage of the present invention is that it provides an update service that allows local processes to record, retrieve and distribute state information via table entries in a relational table.
Another non-limiting advantage of the present invention is that it provides an update service that allows processes on a given machine to communicate with a local agent of the update service using a reliable protocol.
Another non-limiting advantage of the present invention is that it provides an update service including a Life Support Service (LSS) process that stores information in separate relational tables for the various types of processes within a distributed computing system.
Another non-limiting advantage of the present invention is that it provides an update service that allows read-write access to relational tables to the LSS process while allowing read-only access to the local processes, which may perform lookups or rescans of the local relational tables.
Another non-limiting advantage of the present invention is that it provides an update service that allows multiple processes on a given machine to perform lookups into the same or different relational tables in parallel without contention and without communication with a server by using a non-blocking coherency algorithm.
Another non-limiting advantage of the present invention is that it provides an update service that allows a specific local process to perform a rescan using a batch processing mechanism when notified of a large number of updates.
Another non-limiting advantage of the present invention is that it provides an update service that allows local updates to be propagated to all other LSS processes in the system.
Another non-limiting advantage of the present invention is that it provides a “heartbeat” service that promptly delivers failure notifications.
Another non-limiting advantage of the present invention is that it provides update and heartbeat services that are “lightweight” and greatly simplified as a result of using a reliable protocol.
According to one aspect of the present invention, a system is provided for monitoring state information in a distributed computing system, including a plurality of nodes which are coupled together by at least one switching fabric. The system includes an update service including a plurality of local applications, each of the local applications respectively residing on a unique one of the plurality of nodes and being adapted to record and update state information from local clients in a local relational table, and a system-wide application which is adapted to propagate the updated state information across the distributed computing system to a plurality of the local relational tables. The system may also include a heartbeat service which is adapted to selectively generate and receive messages throughout the system to indicate whether the components of the system are operating normally.
According to a second aspect of the invention, a distributed file system is provided. The distributed file system includes at least one switching fabric; a plurality of nodes which provide at least one file system service process, and which are communicatively coupled together by the at least one switching fabric; a plurality of local update service applications that respectively reside upon the plurality of nodes and which update state information from local clients on the plurality of nodes in a plurality of local relational tables; and a system wide update service application which communicates updated state information across the distributed file system to a plurality of local relational tables.
According to a third aspect of the invention, a method of monitoring the state of components in a distributed computing system is provided. The distributed computing system includes a plurality of interconnected service nodes, each including at least one local client. The method includes the steps of: monitoring the state of the local clients on each service node; updating information relating to the state of the local clients in a plurality of local relational tables respectively residing on the plurality of service nodes; and communicating the updated state information to the local relational tables on the service nodes over a multicast channel.


REFERENCES:
patent: 6079033 (2000-06-01), Jacobson et al.
patent: 6324572 (2001-11-01), Silverman et al.
patent: 2002/0138551 (2002-09-01), Erickson
patent: 2003/0046286 (2003-03-01), Jacobs et al.
PCT Search Report dated Aug. 18, 2003 corresponding to PCT/US03/18615.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for monitoring the state and operability... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for monitoring the state and operability..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for monitoring the state and operability... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3237850

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.