Technique for referencing failure information representative...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06651183

ABSTRACT:

TECHNICAL FIELD
The present invention relates in general to distributed computing environments having a plurality of processing nodes, and more particularly, to a technique for referencing failure information representative of multiple related failure conditions occurring within the distributed computing environment at the same or different nodes of the plurality of nodes of the environment.
BACKGROUND OF THE INVENTION
A distributed system is often difficult to manage due to complicated and dynamic component interdependencies. Managers are used in a distributed system and are responsible for obtaining information about the activities and current state of components within the system, making decisions according to an overall management policy, and performing control actions to change the behavior of the components. Generally, managers perform five functions within a distributed system, namely configuration, performance, accounting, security, and fault management.
None of these five functions is particularly suited for diagnosing faults occurring in complex distributed systems. Diagnosing faults using manual management is time consuming and requires intimate knowledge of the distributed system. Also, it is difficult to isolate faults in a distributed environment because a resource limitation on one system may cause a performance degradation on another system, which is not apparent unless one is very familiar with the architecture of the distributed application and how the components work together.
In distributed computing environments, many software components are exploited in an interdependent fashion to provide function to the end-user. End-users are often not aware of the interdependencies of the various components; they only know that the environment provides some expected function. The components may be distributed amongst the various compute notes of the distributed computing environment. In cases where a component experiences a failure, this failure can ripple throughout the distributed computing environment, causing further failures on those components that rely upon the failed component for a specific function. This ripple effect continues, with components affecting the function of those components that rely upon them, until ultimately the end-user is denied the expected function.
The challenge in this environment is to trace the failure condition from its symptom (in this case, the denial of the expected function) to as close to the root cause of the problem (in this case, the original failed component) as possible in an acceptable period of time. Complicating this effort is the fact that multiple failure conditions may exist in the distributed computing environment at the same time. To properly identify the root cause, the failure conditions related to the failure symptom in question must be identified, and information pertaining to those failure conditions must be collected. Unrelated failure conditions should be eliminated from the analysis, since repair of these conditions would not lead to a repair of the failure symptom in question. Identifying these related failures has heretofore required an intimate knowledge of the distributed computing environment, its implementation, and the interdependencies of its components. Even with this level of knowledge, problem determination efforts are non-deterministic efforts, based on the “best guess” of the problem investigator as to where the root cause of the failure condition in question may reside. The larger and more complex the distributed computing environment, the more components introduced into the environment, the more difficult it becomes to reliably “guess” where the source of the failure may reside. The knowledge necessary to undertake the problem determination effort resides only with the distributed computing environment manufacturer, making it difficult for distributed computing environment administrators to effectively identify and resolve failures.
DISCLOSURE OF THE INVENTION
Briefly summarized, the present invention comprises in one aspect a method for referencing failure information in a distributed computing environment having a plurality of nodes. The method includes: creating a failure report by recording information on a failure condition upon detection of the failure condition at a node of the distributed computing environment; and assigning an identifier to the failure report and storing the failure report at the node, wherein the identifier uniquely identifies the failure report including the node within the distributed computing environment creating the failure report, and where within storage associated with the node the failure report is located.
In another aspect, the present invention comprises a method for referencing failure information in a distributed computing environment having a plurality of nodes. This method includes: creating a first program failure report upon detection of a first program failure condition at a first node; assigning a first identifier to the first program failure report which uniquely identifies the first program failure report including the node within the distributed computing environment creating the first program failure report and where within storage associated with that node the first program failure report is located; creating a second program failure report upon detecting a second program failure condition at a second node which is related to the first program failure condition, wherein the second program failure report is created by recording information on the second program failure condition at the second node, and wherein the second node and the first node may comprise the same node or different nodes within the distributed computing environment; and assigning a second identifier to the second program failure report which uniquely identifies the second program failure report including the second node within the distributed computing environment creating the second program failure report, where within storage associated with the second node the second program failure report is located, and the first identifier for the first program failure report on the first program failure condition related to the second program failure condition.
Systems and at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform the above-summarized methods for referencing failure information in a distributed computing environment are also described and claimed herein.
To restate, presented is a technique for referencing failure information within a distributed computing environment. Persistent storage is employed which is accessible to all components of the environment. Reports of failures detected by system components, recorded to the persistent storage, preferably describe the nature of the failure condition, possible causes of the condition, and recommended actions to take in response to the condition. An identifier token is assigned which uniquely identifies a specific failure report for the failure condition, including location where the record resides within the distributed computing environment and the location within the persistent storage of that node where the record resides. Using this identifier, the failure report can be located from any location within the distributed computing environment and retrieved for use in problem determination and resolution analysis. This identifier is passed between related components of the environment as part of a component's response information. Should a component experience a failure due to another component's failure, the identifier is obtained from the first component's response information and included within the information recorded as part of the second component's failure report.
In accordance with the principles of the present invention, the previous need to guess where the distributed computing environment problem determination should begin to search for failure records is eliminated. The unique failure identifier p

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Technique for referencing failure information representative... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Technique for referencing failure information representative..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Technique for referencing failure information representative... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3146079

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.