Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2001-02-22
2004-06-15
Kim, Kenneth S (Department: 2181)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C712S032000, C712S228000, C714S012000, C714S015000
Reexamination Certificate
active
06751749
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to high reliability processing, by hardware redundancy. More particularly, the invention relates to a processing system with pair-wise processors that operate in a high reliability mode to detect computational errors, and operate independently in a high performance mode.
2. Related Art
Various approaches exist for achieving high reliability processing.
FIG. 1
illustrates one prior art processor
100
for high reliability processing. The processor
100
includes two execution units
130
and
135
, which are both the same type of arithmetic unit. For example, the two execution units could both be floating point units, or integer units. The processor
100
has architected registers
120
for holding committed execution results. The two execution units
130
and
135
both execute the same instruction stream in parallel. That is, for each instruction an instance of the instruction executes in each respective execution unit
130
and
135
. Then, when the two units are ready to commit the result for an instruction to the register file
120
, the two versions of the result are compared by compare unit
125
. If the compare unit
125
determines that the versions are the same, then the unit
125
updates one or more of the registers
120
with the result. If the versions do not match, then other actions are taken. In one implementation, a counter records whether an error is occurring repeatedly, and if it is, the error is classified as a “hard” failure. In the case of a hard failure, the instruction issue mechanism does not reissue the faulting instruction, but instead executes a “trap” instruction. One such trap leads to a micro code routine for reading out the state of the defective processor and loading it into a spare processor, which restarts execution at the instruction that originally faulted. In an alternative, where no spare processor is available, the trap leads to the operating system migrating the processes on the faulty processor to other processors, which adds to the workload of the other processors.
While this arrangement provides a reliability advantage, it is disadvantageous in that the processor design is more complex than a conventional processor and has greater overhead. Moreover, it limits the processor
100
throughput to have two execution units
130
in the processor
100
both executing the same instruction stream. Another variation of a processor which is designed for exclusively high reliability operation is shown in Richard N. Gufstason, John S. Liptay, and Charles F. Webb, “Data Processor with Enhanced Error Recovery,” U.S. Pat. No. 5,504,859, issued Apr. 2, 1996.
FIG. 2
illustrates another arrangement for high reliability processing. In this voting arrangement, three processors
200
each execute the same program in parallel and versions of a result are compared at checkpoints in the program on a bus
160
external to the processors
100
. If the versions do not match, then other actions are taken, such as substituting a different processor
100
for the one that produced the disparate version. This arrangement is advantageous in that complexity of the individual processors
200
is reduced, and an error producing processor can be identified. Also, the throughput of one of the processors
200
may be greater than that of the one processor
100
in
FIG. 1
, since the individual processor
200
does not devote any of its execution units to redundant processing. However, the arrangement of
FIG. 2
is redundant at the level of the processors
200
, and uses three whole processors
200
to recover from a single fault. Also, the error checking is limited to results which are asserted externally by the processors.
In the related application, a pair of processors use state-of-the-art state recovery mechanisms that are already available for recovering from exceptions and apply these mechanisms to operate in lockstep synchrony in a high reliability mode. This is highly advantageous because it achieves the high reliability without extensive modification to existing processor design. However, it is somewhat limiting because of the required synchrony. That is, in the high reliability mode the processors in the related application must process a stream of instructions in the same sequence.
From the foregoing, it may be seen that a need exists for improvements in high reliability processing.
SUMMARY
The foregoing need is addressed in the present invention. According to the invention, in a first embodiment, a multiprocessing system includes a first processor, a second processor, and compare logic. The first processor is operable to compute first results responsive to instructions, the second processor is operable to compute second results responsive to the instructions, and the compare logic is operable to check at checkpoints for matching of the results. Each of the processors has a first register for storing one of the processor's results, and the register has a stack of shadow registers. The processor is operable to shift a current one of the processor's results from the first register into the top shadow register, so that an earlier one of the processor's results can be restored from one of the shadow registers to the first register responsive to the compare logic determining that the first and second results mismatch. It is advantageous that the shadow register stack is closely coupled to its corresponding register, which provides for fast restoration of results.
In a further aspect of an embodiment, each processor has a signature generator and a signature storage unit. The signature generator and storage unit are operable to cooperatively compute a cumulative signature for a sequence of the processor's results, and the processor is operable to store the cumulative signature in the signature storage unit pending the match or mismatch determination by the compare logic. The checking for matching of the results includes the compare logic comparing the cumulative signatures of each respective processor. It is faster, and therefore advantageous, to check respective cumulative signatures at intervals rather than to check each individual result.
Also, in one embodiment, the instructions have a certain instruction sequence and at least one of the processors may execute instructions in a sequence different than the program sequence, but both of the processors execute store-type instructions according to a sequence in which the store-type instructions occur in the certain instruction sequence. The checkpoints are responsive to store instructions, so that a first sequence of results for the first processor ends at one of the checkpoints with a result for one of the store instructions and a second sequence of results ends at the checkpoint for the second processor with a result for the same one of the store instructions. It is advantageous to trigger checkpoints responsive to store-type instructions so that while an intermediate one of the results of the first sequence of results may be different than a corresponding intermediate one of the results of the second sequence of results, nevertheless the first processor's ending result for the first sequence and the second processor's ending result for the second sequence tend to match unless one of the processors has malfunctioned.
In an alternative embodiment, the second processor executes the instructions in a sequence identical to a sequence in which the first processor executes the instructions, and the checkpoints are responsive to accumulated number of execution cycles. In this embodiment the checkpoints may also be responsive to store instructions. In one such embodiment, the checkpoints are responsive to store instructions and accumulated number of execution cycles if there has been no store instruction since a last checkpoint.
While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will be described
Hofstee Harm Peter
Nair Ravi
England Anthony V.S.
International Business Machines - Corporation
Kim Kenneth S
Salys Casimer K.
LandOfFree
Method and apparatus for computer system reliability does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for computer system reliability, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for computer system reliability will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3339087