Method and apparatus for building an operating environment...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S010000, C714S013000, C714S038110

Reexamination Certificate

active

06769073

ABSTRACT:

BACKGROUND
1. Field of Invention
The present disclosure relates generally to the field of failure-avoiding computer systems. It relates more specifically to the sub-fields of failure analysis, fault identification, and failure avoidance.
2. Cross Reference to Issued Patents
The disclosure of the following U.S. patent is incorporated herein by reference:
(A) U.S. Pat. No. 5,522,036 issued May 28, 1996 to Benjamin V. Shapiro, and entitled, METHOD AND APPARATUS FOR THE AUTOMATIC ANALYSIS OF COMPUTER SOFTWARE.
3. Description of Related Art
In the art of computer systems, a failure event is one where a computer system produces a wrong result. By way of example, a computer system may be processing the personal records of a person who was born in the year 1980 and may be trying to determine what the age of that person will be in the year 2010. Such an age determination might be necessary in the example because the computer system is trying to amortize insurance premiums for the person. Because the computer system of our example is infected with a so-called ‘Y2K’ bug, the computer incorrectly determines that the age of the person in the year 2010 will be negative seventy instead of correctly determining that the person's age will be positive thirty.
The failure event in this example is the production of the −70 result for the person's age. The underlying cause of the failure result is known as a fault event. The fault event in our example might be a section of computer software that only considers the last two digits of a decimal representation of the year rather than considering more such digits.
The above is just an example. There are many possible computer operations that may be characterized as a fault event where the latter eventually causes a failure event. Faults can be hardware-based or software-based. An example of a faulty piece of hardware is a Boolean logic circuit that produces an output signal having a small noise spike. Generally, this noise spike does not affect output signals of the computer system. However, if conditions are just right, (e.g., other noises add up with this small noise spike), the spike may cause a wrong output state to occur in one of the signals of the computer. The production of such a wrong output state is a failure. The above-described Y2K problem is an example of a software-based fault and consequential failure.
It is desirable to build computer systems that consistently output correct results. This generally means that each of the operational hardware modules and executing software modules needs to be free of faults.
In general, producing fault-free software is more difficult than producing fault-free hardware. Techniques are not available for proving that a given piece of computer software is totally fault-free. Software can be said to be fault-free only to the extent that it has been tested by a testing process that is itself fault-free. In real life applications, exhaustive testing is not feasible. Even a single, numerical input to a program may create a requirement for testing numerous possibilities in the range from minus infinity to plus infinity. If there are two such inputs, they may create a need for a two dimensional input testing space of infinite range. Three variables may call for a three dimensional input space, and so on. If one attempts to exhaustively run all the input combinations it will take so much time that the utility and need for the application program may be already gone.
In the mechanical arts, it is possible to make a mechanical system more reliable or robust by designing various components with more strength and/or material than is deemed necessary for the predicted, statistically-normal environment. For example, a mechanical bridge may be made stronger than necessary for its normal operation by designing it with more and/or thicker metal cables and more concrete. The added materials might help the bridge to sustain extraordinary circumstances such as unusually strong hurricanes, unusually powerful earthquakes, etc.
If there is a hidden fault within a mechanical structure, say for example that internal chemical corrosion creates an over-stressed point within one cable of a cable-supported bridge, the corresponding failure (e.g., snapped cable) will usually occur in close spatial and/or temporal proximity to the fault. The cause of the mechanical failure, namely the chemical corrosion inside the one cable, will be readily identifiable (in general). Once the fault mechanism is identified, the replacement cable and/or the next bridge design can be structured avoid the fault and thereby provide a more reliable mechanical bridge.
Computer software failures are generally different from mechanical system failures in that the software failures do not obey the same simplified rules of proximity between the cause (the underlying fault) and effect (the failure). The erroneous output of a computer software process (the failure) does not necessarily have to appear close in either time or physical proximity to the underlying cause (fault).
A number of so-called, fault-tolerant techniques exist in the conventional art. A first of these techniques applies only to hardware-based faults and may be referred to as ‘checkpoint re-processing’. Under this technique, a single piece of hardware moves forward from one operational state to the next. Every so often, at a checkpoint, the current state of the hardware is stored into a snapshot-retaining memory. In other words, a retrievable snapshot of the complete machine state is made. The machine then continues to operate. If a hardware failure is later encountered, the machine is returned to the state of its most recent checkpoint snapshot and then allowed to continue running from that point forward. If the hardware failure was due to random noise or an intermittent circuit fault, these faults will generally not be present the second time around and thus the computer hardware should be able to continue processing without encountering the same failure again. Of course, if the fault is within the software rather than the hardware, then re-running the same software will not avoid the fault, but rather will merely repeat the same fault and will typically manifest its consequential failure.
A second of the so-called fault-tolerant techniques may be referred to as ‘majority voting’. Here, an odd number of hardware circuits and/or software processes each processes the same input in parallel and produces a respective result. In the case of the software processes, it may be that different groups of programmers worked independently to encode solutions for a given task. Thus, each of the software programming groups may have come up with a completely different software algorithm for reaching what should be the same result if done correctly.
When the different hardware and/or software processes complete their operations, their results are compared. If the results are different, then a vote is taken and either the majority or greatest plurality with a same result is used as the valid result. This, however, does not guarantee that the correct result is picked. It could be that the majority or winning plurality is wrong, despite their numerical supremacy. The voting process itself may be the underlying cause for a later-manifested failure. This is an example showing that adding more software (e.g., coding and executing different versions of software) to software does not necessarily lead to more reliable and fault-free operation.
Software systems are often asked to operate in input space which has not been previously encountered. A crude analogy is that of an automated spaceship moving forward in space towards uncharted regions. The spaceship encounters a new situation that was not previously anticipated and tested for. The question is then raised, are we going to return the spaceship to Earth to reprogram it? And if so, what are we going to reprogram it to deal with? We have not allowed it to operate into the unknown future yet and thus we have not yet experienced the future se

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for building an operating environment... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for building an operating environment..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for building an operating environment... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3233421

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.