Systems and methods for transient error recovery in reduced...

Electrical computers and digital processing systems: processing – Processing control – Context preserving (e.g. – context swapping – checkpointing,...

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C712S244000, C714S017000, C714S021000

Reexamination Certificate

active

06247118

ABSTRACT:

FIELD OF THE INVENTION
The present invention generally relates to data processing, and more particularly, to fault recovery in a data processing system.
BACKGROUND OF THE INVENTION
A popular design for central processing units is reduced instruction set computer (RISC) processors using a pipeline architecture. With pipeline architecture, the tasks performed by a processor are broken down into a sequence of functional units referred to as stages or pipeline stages. Each functional unit receives one or more inputs from the previous stage and produces one or more outputs which may then be used by the subsequent stage. Thus, one stage's output is usually the next stage's input. Consequently, all of the stages are able to work in parallel on different, although typically sequential, instructions in order to provide greater throughput.
Typical stages of a RISC processor pipeline include instruction fetch, register fetch, arithmetic execution, and write-back to registers. In order to improve performance, the pipeline receives a continuous stream of instructions fetched from sequential locations in memory using addresses that are typically stored in a program counter or other suitable device. When several instructions are being concurrently processed in the pipeline and each pipeline stage is performing its designated task, a single instruction can be executed approximately every clock cycle. This design offers greater efficiency than other architectures, such as complex instruction set computer (CISC) architectures, which require more than one clock cycle to execute an instruction.
Because of its many advantages, only a few of which are discussed above, the RISC architecture enjoys a wide variety of applications including those with safety critical implications such as health care, transportation, military, space, and some manufacturing environments.
The increased reliance on RISC processor-based automated data processing systems in safety critical applications raises the need for the system to be dependable; that they perform their expected task(s) correctly with a high degree of confidence. Design for dependability is one of the many drivers that define the specifications of the RISC processor-based system. Fault avoidance, removal, and tolerance are three approaches that improve system dependability. Fault avoidance is usually achieved by processes and methods used to generate the design of the system such as adherence to proven design and development processes or the use of formal methods to validate the correctness of a design. Fault removal is usually achieved by extensive system testing. A fault is removed once it is discovered during system test. Fault tolerance is achieved by incorporating features in the design that enable continued correct system operation in spite of the occurrence of a fault.
A fault may be permanent or transient. A permanent fault is one that causes the RISC processor's behavior to permanently deviate from its specifications, and typically requires human intervention to ameliorate its effect. A transient fault, on the other hand, causes the behavior deviation for a limited time period. The processor typically resumes its behavior as specified once the cause of the fault disappears and the effect of the fault is removed from the system.
Transient faults are typically caused by an event in the processor's physical environment. For example, in an industrial application, many transient faults are due to the electrically noisy environment where equipment switching causes voltage spikes that impact the processor's power supply, and thus causing a transient fault in the microelectronics circuitry that make up the processor. A single event upset (SEU) is yet another cause of transient faults in the microelectronics circuitry of a RISC processor. SEUs are usually caused by a natural or man-made radiation particle that changes the state of a processor by altering its memory content, such as a bit in one of its data or control registers, while it travels through space. Once the radiation particle passes through the circuitry, it no longer affects the microelectronics device. In either of these cases, as well as others, the transient fault may cause the processor to exhibit an error in its processing. In safety critical applications, declaring a processor as permanently failed due to a transient fault may not be a suitable course of action for reasons such as the lack of spare processors to continue operation. This is particularly true in space applications where processors are expected to operate for an extended time period to justify the cost of the mission. Thus, given the existence of transient faults in certain computing environments, it is desirable to be able to detect and recover from transient faults as quickly and as efficiently as possible so that the performance of the processor is not significantly hampered or degraded.
The impact on the performance of a processor from a transient fault depends upon the overhead associated with recovery. Two factors which largely control the overhead of transient fault recovery are: (1) the time spent to continually gather the data necessary in anticipation of recovering from a transient fault, and (2) the actual recovery time, i.e., the time it takes the processor to remove the effect of the fault from its memory and to be ready to resume correct operation. Following are discussions of several techniques used for transient fault recovery.
A relatively common technique for transient fault recovery is checkpoint retry in which the current state data of a program is saved in a memory cache at various points in the execution of the program code. These points are referred to as checkpoints. Checkpoints are taken at the software level where the program is modified to permit the capture of checkpoints and the rollback to a suitable checkpoint during recovery. Typically, only the values of program variables that changed since the last checkpoint are stored at a next checkpoint. When an error is detected, the program state is restored (also referred to as rolled back) to the last checkpoint that preceded the error in the instruction stream. The amount of roll back necessary to reach the nearest checkpoint is called the rollback distance. The rollback distance may be measured by the number of instructions the effect of which must be nullified to reach the nearest checkpoint. Execution is resumed from the checkpoint once the program state is restored from the data stored at the checkpoint. A drawback to this technique is the complexity of the code necessary to allow the data to be gathered at each checkpoint. Another drawback is the relatively high overhead on system performance. The performance overhead of checkpoint retry is largely due to the overhead required for storing the data associated with all the instructions between consecutive checkpoints. This same data is also restored during an actual recovery which, likewise, is time consuming. Further, if more frequent checkpoints are used in order to reduce the amount of data which must be stored at every checkpoint and then restored in case of an error, then more of the processor's time is spent performing error checking. In computing applications requiring control based on precise time intervals, the recovery time spent error checking and rolling back to a checkpoint can be difficult to determine apriori. Finally, in environments where the processor's next task depends upon changes in its physical environment, such as the firing of a jet to correct a spacecraft's attitude or the reaction to a change in the state of a stage in a manufacturing assembly line, recovery times must be bounded to prevent the processor from reacting to a set of environmental conditions that does not truly reflect the processor's physical environment. The analysis and determination of proper recovery time bounds is very difficult.
Another recovery technique referred to as instruction retry is a variation on the checkpoint retry scheme in that the rollback

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Systems and methods for transient error recovery in reduced... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Systems and methods for transient error recovery in reduced..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Systems and methods for transient error recovery in reduced... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2513467

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.