Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-01-11
2003-07-01
Iqbal, Nadeem (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
Reexamination Certificate
active
06587960
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to monitoring, detecting, and isolating failures in a system, and in particular to tools, such as a model of the system, applied for analyzing the system.
Quick and easy determination of failure causes (i.e. failure diagnosis) is a key requirement for providing services like time-to-fix contracts to information technology (IT) departments.
In current solutions, a specially trained engineer is typically requested to go on-site in case a (customer's) system breaks down. There, s/he may then use software tools to search for the root cause of the system crash. This software is typically a collection of test programs testing parts of the system undertest (SUT). The engineer selects a couple of tests based on experience from previous cases or may choose to run a complete test suite. This requires a reboot of the SUT, thus reducing system up-time from the customer's perspective. In addition, this approach requires the SUT being functional to a certain degree, so that a minimal operating system, like DOS, is bootable. Otherwise, the engineer is left to his experience.
This conventional approach has several drawbacks. First, it is a manual process. Simple test suites could be defined, but detailed testing is only done on sub-systems that the engineer suspects may cause the problem. Secondly, the SUT has to be rebooted to run the tests. In cases where the system has regained a productive state, this lowers system uptime. Thirdly, conventional test suites check for a list of potential failure causes. This implies that failure causes unknown to the test suite will never be detected. Expert systems show a way out of these problems.
Expert systems have been used for diagnosing computer failures, as described e.g. by J. A. Kavicky and G. D. Kraft in “An expert system for diagnosing and maintaining the AT&T 3B4000 computer: an architectural description”, ACM, 1989. Analysis of data from on-bus diagnosis hardware is described in Fitzgerald, G. L., “Enhance computer fault isolation with a history memory,” IEEE, 1980. Fault-tolerant computers have for many years been built with redundant processing and memory elements, data pathways, and built-in monitoring capabilities for determining when to switch off a failing unit and switch to a good, redundant unit (cf. e.g. U.S. Pat. No. 5,099,485).
Prior diagnostic systems for determining likely failed components in an SUT include model-based diagnostic systems. A model-based diagnostic system may be defined as a diagnostic system that renders conclusions about the state of the SUT using actual SUT responses from applied tests and an appropriate model of correct or incorrect SUT behavior as inputs to the diagnostic system. Such a diagnostic system is usually based upon computer generated models of the SUT and its components and the diagnostic process.
It is usually desirable to employ a model-based diagnostic system that is based upon a more manageable model of SUT characteristics. Such a model-based diagnostic system usually minimizes the amount of modeling information for an SUT that must be generated by a user before the system can be applied to the SUT. Such modeling usually speeds the process of adapting the diagnostic system to differing SUTs and increases confidence in the determinations rendered by the diagnostic system.
Model-based diagnostic systems are known e.g. from W. Hamscher, L. Console, J. de Kleer, in ‘Readings in system model-based diagnosis’, Morgan Kauffman, 1992. A test-based system model is used by the Hewlett-Packard HP Fault Detective (HPFD) and described in HP Fault Detecfive User's Guide, Hewlett-Packard Co., 1996.
U.S. Pat. No. 5,808,919 (Preist et al.) discloses a model-based diagnostics system, based on functional tests, in which the, modeling burden is greatly reduced. The model disclosed in Preist et al. employs a list of functional tests, a list of components exercised by each functional test along with the degree to which each component is exercised, by each functional test, and the historical or or estimated a priori failure rate for individual components. Such model data may be rapidly and easily determined- or estimated by test engineers, test programmers or others familiar with, but not necessarily expert on, the device under test. Typically, test engineers may develop the models in a few days to a few weeks depending on the complexity of the device under test.
U.S. Pat. No. 5,922,079 (Booth et al.) discloses an automated analysis and troubleshooting system that identifies potential problems with the test suite (ability of the model to detect and discriminate among potential faults), and also identifies probable modeling errors based on incorrect diagnoses.
EP-A-887733 (Kanevsky et al.) discloses a model-based diagnostic system that provides automated tools that enable a selection of one or more next tests to apply to a device under test from among the tests not yet applied based upon a manageable model of the device under test.
In the above three model-based diagnostic systems, a diagnostic engine combines the system-model-based and probabilistic approaches to diagnostics. It takes the results of a suite of tests and computes—based on the system model of the SUT—the most likely to be failed components.
The diagnostic engine can be used with applications where a failing device is to be debugged using a predetermined set of test and measurement equipment to perform tests from a predesigned set of tests. A test represents a procedure performed on the SUT. A test has a number of possible outcomes.
The tests can be defined to have only two outcomes: pass or fail. For the purpose of this invention, devices or components shall be regarded as either “good” or “bad” and tests shall either “pass” or “fail”. In an example, a test for repairing a computer may involve checking to see if a power supply voltage is between 4.9 and 5.1 volts. If it is, the test passes. If it is not, the test fails. The set of all tests available for debugging a particular SUT shall be called that SUT's test suite.
Using test results received from actual tests executed on the SUT and the system model determined for the SUT, the diagnostic engine computes a list of fault candidates for the components of the SUT. Starting, e.g., from a priori failure probabilities of the components, these probabilities are then weighted with the model information accordingly if a test passes or fails. At least one test has to fail, otherwise the SUT is assumed to be good.
In all the known model-based diagnostic systems, in particular the provision of the system model has been proved difficult, specifically in rather complex systems. One reason is that for each system an individual system model has to be ‘created’ which generally cannot be used even for only slightly different systems. Furthermore, the modeling process turns out to be a rather costly process, on one hand since the modeling is a manual process, and, on the other hand, since this manual process requires highly educated and therefore expensive personnel.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to facilitate the provision of system models to be applied in model-based diagnostic systems. The object is solved by the independent claims. Preferred embodiments are shown by the dependent claims.
The invention provides an improved tool for determining a system model describing a relation between applicable tests and components of a system under test (SUT). The system model can then be applied in conjunction with actual test results for determining at least one fault candidate representing a specific component of the SUT likely to have caused a fault or failure of the SUT. Each fault candidate is preferably provided with a certain probability that the component has caused the failure. The fault candidates can be preferably represented in a probability-ranked list. Thus, the invention can be applied in diagnosis tools allowing to detect any kind of system failures—hardware, configuration, and software—
Barford Lee Alton
Zurhorst Christian
Agilent Technologie,s Inc.
Iqbal Nadeem
LandOfFree
System model determination for failure detection and... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System model determination for failure detection and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System model determination for failure detection and... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3003570