Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-03-31
2003-07-22
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S019000, C714S020000
Reexamination Certificate
active
06598179
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to diagnosis of fault conditions in computer systems, and specifically to fault diagnosis using error log analysis.
BACKGROUND OF THE INVENTION
Because of the increasing complexity of computers and computer-based systems, system administrators and maintenance personnel generally do not have sufficient knowledge and expertise to diagnose all of the faults that can occur in these systems. A variety of diagnostic tools have been developed in order to help in identifying the cause of such faults and determining the corrective action that must be taken. These tools generally receive and analyze error reports from different system components. In its most basic embodiments, the analysis is based on simple, pre-programmed “if-then” rules. More sophisticated tools have been developed that use techniques such as artificial intelligence, expert systems, neural networks and inference engines. Tools of this sort are described, for example, in U.S. Pat. Nos. 4,633,467, 4,964,125 and 5,214,653, whose disclosures are incorporated herein by reference.
In many computer systems, a system error log stores a record of all of the error reports that are received from system components. The error log is supposed to be used by the system administrator or maintenance engineer in tracing and understanding faults that have occurred. The number of errors in the log can be very large, however, and with the exception of a few patterns that the system administrator may recognize from experience, the error log generally provides no clue as to the source of the error or how to solve it. At best, an enterprising system administrator may be able to find faults that are relatively straightforward by looking up error codes from the error log in a system maintenance manual. In more complex cases, the system administrator may not even be able to determine whether the entries in the error log are due to a hardware fault or to a software problem.
U.S. Pat. No. 5,463,768, whose disclosure is incorporated herein by reference, describes a method and system for automatic error log analysis. A training unit receives historical error logs, generated during abnormal operation or failure of machines of a given type, together with the actual repair solutions that were applied to fix the machines in these circumstances. The training unit identifies and labels sections, or blocks, within the error logs that are common to multiple occurrences of a given fault. These blocks are assigned a weight indicative of their value in diagnosing the fault. A diagnostic unit receives new error logs associated with abnormal operation or failure of a similar machine, and compares the new error logs to the blocks identified by the training unit. The diagnostic unit uses similarities that it finds between blocks in the new error log and the identified historical blocks to determine a fault diagnosis and suggested solution. The solution receives a score, or similarity index, based on the weights of the blocks.
SUMMARY OF THE INVENTION
It is an object of the present invention to provide improved methods and apparatus for diagnosing faults in a computer system.
It is a further object of some aspects of the present invention to provide methods and apparatus that assist the operator of a computer system in understanding and repairing faults that occur in the system.
It is still a further object of some aspects of the present invention to provide improved methods and apparatus for analysis of an error log generated by a computer system.
In preferred embodiments of the present invention, an error log analyzer (ELA) scans error logs generated by a computer system. The logs are preferably generated whenever the system is running and are analyzed by the ELA at regular intervals and/or when a fault has occurred. The ELA typically comprises a software process running on a node of the computer system. Alternatively, the ELA may comprise dedicated computing hardware.
The ELA processes error log data in three stages:
A selection stage, in which the ELA determines, for each error in the log, whether the error is of relevance to fault conditions of interest. Relevant errors are held for further processing, while irrelevant errors are discarded.
A filtering stage, in which certain errors are composed, i.e., filtered and grouped together, into events, which are known to be associated with particular fault conditions.
An analysis stage, in which the events are checked in order to decide whether their numbers and types are such as to indicate that a fault exists that requires service attention. If so, the problem and, preferably, suggested solutions are reported to a system operator.
At each stage, the ELA processes the errors or events in accordance with predetermined decision criteria. The criteria are expressed in terms of parameters, which are preferably held in suitable tables. Unlike diagnostic systems known in the art, such as expert systems and neural networks, the tables can be edited and updated by development and support personnel, based on field experience with the system and on the particular operating conditions and requirements to which a given system is subjected. The tables can also be copied from one computer system to another. Thus, the present invention provides a tool for fault diagnosis that can be made to identify and offer solutions to an essentially unlimited range of errors appearing in the error log, based on decision criteria that are accessible for adjustment and modification by users in a straightforward manner.
There is therefore provided, in accordance with a preferred embodiment of the present invention, a method for diagnosing faults in a computer-based system, including:
reading a log of errors of different kinds that have been recorded in the system;
selecting from the log errors of those kinds that are relevant to one or more predetermined types of faults that can occur in the system;
filtering the selected errors so as to compose one or more events, each event including one or more occurrences of one or more of the relevant kinds of the errors; and
analyzing the composed events to reach an assessment that at least one of the predetermined types of faults has occurred.
Preferably, selecting the errors includes providing a respective callback function for each relevant kind of error, wherein the callback function analyzes data in the error log associated with the error in order to determine whether the error should be selected.
Further preferably, filtering the selected errors includes filtering the errors according to filtering conditions specified in a filtering table, each filtering condition specifying a set of errors required in order to compose one of the events. Most preferably, selecting the errors includes selecting from the log those errors that are known to belong to the set of errors associated with one or more of the filtering conditions. In a preferred embodiment, the set of errors required in order to compose one of the events includes multiple occurrences of one of the kinds of errors or, additionally or alternatively, one or more occurrences of each of a plurality of the kinds of errors. Preferably, the filtering condition specifies a maximum time lapse during which all of the plurality of the errors must occur in order for the condition to be satisfied. Additionally or alternatively, the filtering table further specifies a level of severity for at least some of filtering conditions, and filtering the selected errors includes applying the filtering conditions to the errors in the error list in order of the level of severity of the conditions.
Preferably, filtering the selected errors includes removing errors that have been used in composing one of the events from the error list, whereby any given error is not used to compose more than a single event. Most preferably, removing the errors from the error list includes removing both errors specified as being required to compose a given one of the events and errors specified as being associated with the giv
Chirashnya Igor
Erblich Doron
Gewirtesman Raanan
Darby & Darby
McCarthy Christopher
LandOfFree
Table-based error log analysis does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Table-based error log analysis, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Table-based error log analysis will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3032460