Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-05-12
2001-01-09
Iqbal, Nadeem (Department: 2785)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S011000
Reexamination Certificate
active
06173414
ABSTRACT:
FIELD OF THE INVENTION
The present invention generally relates to digital data processing, and more particularly, to error detection in digital data processors.
BACKGROUND OF THE INVENTION
Modern technology has brought about many advancements in the design and implementation of computer processors. However, the possibility of errors arising in digital signals representing either data or control words is still problematic in all computer systems. An undetected error, due to a variety of fault sources, in either the processing control flow or the data may result in propagation of erroneous data each time a further operation is performed on either the data, or any data derived from the erroneous data. An error in a control word can result in rapid propagation of corrupted data and the corruption of good data by the processing with the erroneous control word. The many efforts made in recent years to minimize or contain the adverse effects of faults, as they are manifested through resultant errors, have drastically reduced the potentially devastating impact of errors on the integrity of computational results. However, error detection and recovery continue to be major concerns to computer system designers as designs are constantly being driven to higher standards of dependability, throughput, levels of integration, and computational complexity.
A variety of strategies and techniques have been proposed for error detection. Analyses to determine the optimal error detection technique must consider factors such as error detection latency and coverage. Strategies based on information redundancy and techniques for their realization yield designs with low error detection latency. The percent of detectable errors, i.e., error coverage, is often used to select the desired information redundancy technique. The range of techniques spans the use of information encoding schemes, i.e., check codes, to using a complete copy of the computer system, i.e., a redundant or duplicate system. Error check codes use a plurality of additional information, i.e., bits which are an encoded representation of the original data or control sequence in order to determine whether the data or control sequence has erroneously changed. Examples of error check codes include parity code for a data word and Cyclic Redundancy Code (CRC) for execution control sequences.
If check codes are utilized, an operation is performed so that the check code is valid after each operation. With arithmetic logic, for example, the operation may be carried out in a different number system such as with the residue number scheme, a detailed discussion of which can be found in Avizienis, A. A., “Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital Design,” IEEE Trans. Comp., Vol. C-20, No. 11, November 1971, pp. 1322-1331. However, the use of a different number system involves an initial conversion to that number system, and, a subsequent conversion back from that number system after the operation is completed. Accordingly, this method of error detection may significantly reduce the performance of the data processor.
The use of redundant, or duplicate, circuitry to check for errors has long been recognized as a highly effective error checking technique. The redundant circuitry approach essentially comprises two processors, a primary processor and a redundant processor which are similarly connected to receive identical addresses, data, control signals and instructions. The primary processor, referred to as the master processor, provides normal processing and control. The redundant processor, referred to as the checker processor, runs in parallel with the master processor. If the system is operating properly, the master and checker processors operate in lock step and the results determined by the two processors should be equal or identical. Otherwise, an error has occurred in the system. This approach has the advantage that the checker processor is identical to the master processor, and therefore, can be used as a spare resource in the event that the master processor fails or become faulty. This approach, however, requires twice as much hardware as a single processor, though it has a smaller impact on performance than the check code approach discussed above. The master and checker processors typically run in parallel, and only processor outputs are used for error detection. Thus, the impact on internal processor throughput may be essentially eliminated.
Further, the redundant circuitry (also referenced to as master/checker) approach to fault detection requires that the master's data be visible to the checker. However, current trends toward increased integration on a chip and the associated computational complexity have decreased visibility to internal operations. The result of an erroneous operation which results in changing only an internal state (e.g., registers, caches, etc.) that is not visible to the checker may not be detected for a relatively long time. The error, in such a case, may only become visible when the master's state data is made visible to the checker, or another internal operation uses the state data in a manner that makes it visible. This can result in exceptionally long error detection latencies. Any effort to reduce the error detection latency by making the output of each master's operation visible to the checker is typically not practical because of pin limitations and the adverse impact on performance. A processor's internal band-width, i.e., its processing throughput, is typically much greater than the external band-width. Input/output operations are relatively slow, and therefore, it is generally considered too costly to make all the master's output data visible for checking.
Therefore, a heretofore unresolved need existed in the industry for an error detection system and method that provides improved detection in a master/checker system with minimal error detection hardware overhead, minimal error detection latency, and minimal adverse impact on performance.
SUMMARY OF THE INVENTION
It is therefore an object of the present invention to provide improved error detection.
It is another object of the present invention to provide improved error detection in a master/checker configuration with minimum hardware and performance overhead.
It is yet another object of the present invention to provide improved error detection in a master/checker configuration with reduced error detection latency.
These and other objects of the present invention are provided by a fault-tolerant digital data processing system that comprises a first microcircuit that performs an internal operation on a data set to generate a first internal state, and a second microcircuit that performs a identical internal operation on an identical data set to generate a second internal state. This is commonly referred to as a master/checker, or duplicate, configuration. The system further comprises a first data encoding mechanism that encodes the first internal state into a first code, a second data encoding mechanism that encodes the second internal state into a second code, and a comparator external to the first and second microcircuits that compares the first code and the second code to determine if an error has occurred. Additionally, by encoding the internal states of the first and second microcircuits so as to reduce the number of bits of data needed to represent the outputs, fewer pin connections are needed to make the outputs externally visible for performing error detection.
The data encoding mechanism may be as simple as a single bit parity encoding operation, or any other suitable scheme that meets system requirements for, among other things, the number of signals to be compared, the fault environment, or the desired fault coverage. The comparator may be configured to generate a predetermined signal if the first code and the second code are equal or identical. Further, the first microcircuit and the second microcircuit may perform transforming operations.
In accordance with another aspect of the pres
Abouelnaga Amir A.
Zumkehr John F.
Alston & Bird LLP
Iqbal Nadeem
McDonnell Douglas Corporation
LandOfFree
Systems and methods for reduced error detection latency... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Systems and methods for reduced error detection latency..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Systems and methods for reduced error detection latency... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2538943