Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1993-03-18
2001-01-30
Beausoliel, Jr., Robert W. (Department: 2785)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C717S152000
Reexamination Certificate
active
06182243
ABSTRACT:
TECHNICAL FIELD
The present invention relates to program debugging methods and more particularly to a selective method for capturing data in software exception conditions during the operation of a data processing system.
BACKGROUND ART
When a new computer software product is conceived, there is a typical cycle or process that takes place in the course of bringing the product to the market to ensure all the reliability and serviceability required by the customer. The programming cycle typically includes:
the conception of the idea
the design of the software to implement the idea
the coding of the program based on the design
the initial testing and debugging in the development environment
the testing at the user site
the final release of the software to the market
the maintenance
the update of the software with new releases
Normally the release of a software product depends on meeting a development calendar. If defects or errors (known as bugs) appear in the code, the product deadlines will be missed. This is particularly likely if the bugs are complex, subtle or otherwise difficult to find. Such delays can cause a software product to fail in the marketplace. In the same way, the availability, the quality and the ease of maintenance of a software product are the key factors of a success in a competitive environment.
Historically, most software was designed under the assumption that it would never fail. Software had little or no error detection capability designed into it. When a software error or failure occurred, it was usually the computer operating system that detected the error and the computer operator cancelled the execution of the software program because the correct result was not achieved. To facilitate the development, test and maintenance of more and more important and complex programs, it has been necessary to define debugging methods and tools to detect, isolate, report and recover all software and hardware malfunctions.
Error handling and problem determination are based on the following principles:
all programs may fail or produce erroneous data
a program detecting an error may be itself in error
all detected errors or failures, permanent or transient, must be reported with all the data needed to understand what is going on. A transient error means a temporary failure which is recovered by the program itself but is however reported for information purpose.
all errors fall in one of the following categories:
Hardware error or failure
Functional error
Invalid input internal to the program
Invalid input external to the program
Time out
Exception conditions, such as:
divide error
invalid address
loop
invalid operation code
floating point error
. . .
Exception conditions are errors detected by the processor itself in the course of executing instructions. They can be classified as Faults, Traps, or Aborts depending to the usage of the different suppliers of data processors.
Upon a software error, the most commonly used method is to capture the entire storage area allocated to the program: this process is called Global or Physical Dump. However,
The error occurs before the program can detect it and the key data required to determine the cause of the error or failure are often lost or overlaid
the more complex the error is, the more data are generated
the dispersion of the information in the system storage increase the difficulty to isolate complex errors
the transfer of a large quantity of data is resource consuming in time and storage and can affect the customer performances.
As frequently happens, so much output is generated that any significant information is buried in a mass of unimportant details. Thus the programmer must always guess whether the benefits of preserving and retrieving all the data in the processor storage outweigh the disadvantages of an endless and laborious analysis. In another way, it is not always obvious to follow the path of execution to the point where the error finally appears and most program developers use a process called Trace to isolate a software error. According to this process, Trace points are placed within the failing program in order to sample data through the path of execution, the problem is recreated and data from the trace points are collected. Unfortunately, Traces have some bad side effects including the following:
Traces require a branch to a trace routine every time a trace point is encountered, often resulting in a significant impact to not only the problem program's performance, but to other programs executing on the same system
Traces requires large data sets to contain the volumes of data generated by Trace points
for the programmer that uses Traces to capture diagnostic data, he invariably finds himself sifting through large amounts of data, the majority of which does not reflect on the cause of the error
the problem must be reproduced. If the software problem was caused by a timing problem between two programs (e.g., two networking programs communicating with each other), the trace can slow the program execution down to the point where most timing problem cannot be recreated.
Solicited Dumps and Traces, as described previously, are triggered on request of an external intervening party: console, host, operator . . . . They are based on a methodology which waits for the damage caused by a software error to surface. In both cases large amounts of data are collected, hopefully catching the data that will determine what was wrong.
Immediate error detection and automatic diagnostic data collection can be achieved by means of error code placed directly within the program during development. When an error or failure occurs, it is detected by the program itself which calls a process to capture and report only the data required to debug the error: this process is usually called Error Notification. The description of the data such as layout and format are generally stored in a predefined table whose entries are selected by the error detection code of the program. Typical of this state of the art is the U.S. Pat. No. 5,119,377 disclosing a method and system for detecting and diagnosing errors in a computer program. The major advantages of this process are the following:
The reported information can be processed, visualized and interpreted in real time mode.
The data required to diagnose the error are captured the first time the error appears: the problem does not have to be recreated.
Error can be isolated and its propagation stopped before permanent damage can occur.
The data reported are limited to the error to be resolved which facilitates data report and the problem isolation and correction.
This process is only called conditionally when the error is detected and remains completely idle until such condition occurs. The impact on the computer resources and the programs performances remains minimum.
Selective Dumps, limited to the error context can be automatically triggered and retrieved on request of the program itself (Unsolicited Dump).
Permanent Traces can be included in the captured and reported data. These Traces, also called internal Traces, are an integral part of the code. They are cyclically updated according to the program progress and thereby allow a dynamic view of the suspected code.
The process can be extended to events to report data at some specific stages of the code progress or at particular occurrences.
The Error Notification process, previously described, implies that all pieces of code can detect and describe the errors in which they are involved with the actions to be done to recover the control or minimize the impact of the failing element. That means a systematic checking of all inputs (internal and external), the use of hardware checkers and the implementation of functional tests and timers in the key points of the code.
At this stage of analysis, it appears opportune to classify errors in two different types:
Minor Errors: the program detects itself the error or failure and the associated pertinent information are collected by means of a dedicated error code.
Major Errors: the program loses cont
Berthe Jean
Perrinot Herve
Beausoliel, Jr. Robert W.
Elisca Pierre Eddy
International Business Machines - Corporation
Timar John J.
LandOfFree
Selective data capture for software exception conditions does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Selective data capture for software exception conditions, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Selective data capture for software exception conditions will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2491677