Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-08-12
2002-03-12
Iqbal, Nadeem (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
Reexamination Certificate
active
06357024
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to electronic computer systems, and more particularly to fault-tolerant or reliable electronic systems employing multiple processing units in order to reduce computational errors and/or determine the source of computational errors. The invention described herein may also be useful in supporting the development or investigation of improvements to components used in electronic systems employing multiple processing units.
2. Description of the Relevant Art
An electronic circuit such as a microprocessor may fail to produce a correct result due to “hard” failures or “soft” errors. Hard failures are permanent and reproducible, and typically result from design errors, fabrication errors, fabrication defects, and/or physical failures. A failure to properly implement a functional specification represents a design error. Fabrication errors are attributable to human error, and include the use of incorrect components, the incorrect installation of components, and incorrect wiring. Examples of fabrication defects, which result from imperfect manufacturing processes, include conductor opens and shorts, mask alignment errors, and improper doping profiles. Physical failures occur due to wear-out and/or environmental factors. The thinning and/or breakage of fine aluminum lead wires inside integrated circuit packages due to electromigration or corrosion are examples of physical failures. Soft errors, on the other hand, are temporary and non-reproducible. Soft errors are often the result of transient phenomenon such as electrical noise (e.g., power supply “glitches” and ground “bounce”), energetic particles (e.g., alpha particles), or “marginal” circuit design.
Incorrect results cannot be tolerated in computer systems used in, for example, aircraft flight control systems, missile guidance systems, and banking transactions. Computer systems used in such critical applications must be highly reliable. One method used to increase the reliability of such computer systems is called functional redundancy checking (FRC). FRC typically employs two electronic microprocessor devices functioning as central processing units (CPUs). A first “master” microprocessor and a second “checker” microprocessor receive the same input signals and execute instructions simultaneously (i.e., in lock step). The checker microprocessor compares the output signals produced by the master microprocessor to its own internally-generated output signals. If any output signal produced by the master microprocessor does not match the respective output signal produced by the checker microprocessor, the checker microprocessor generates an error signal which initiates corrective action (i.e., “notification”).
FIG. 1
is a block diagram of a typical electronic computer system
10
employing FRC. Electronic computer system
10
includes identical first and second CPUs
12
a
and
12
b
, a processor bus
14
, chip set logic
16
, a memory unit
18
, a memory bus
20
, a system bus
22
, and a peripheral device
24
. CPUs
12
a
and
12
b
are typically microprocessor integrated circuits formed upon a single monolithic semiconductor substrate. Processor bus
14
couples both CPU
12
a
and CPU
12
b
to each other and to chip set logic
16
. Chip set logic
16
functions as interface between CPUs
12
a-b
and system bus
22
, and between CPUs
12
a-b
and memory unit
18
. System bus
22
is adapted for coupling to one or more peripheral devices. Peripheral device
24
is coupled to system bus
22
. Peripheral device
24
may be, for example, a disk drive unit, a video display unit, or a printer. Memory unit
18
stores data, and typically includes semiconductor memory devices. Chip set logic
16
is coupled to memory unit
18
via memory bus
20
, and may include a memory controller.
CPUs
12
a
and
12
b
include built-in functional redundancy checking circuitry. During system initialization, either CPU
12
a
or CPU
12
b
is configured to be the master, and the other CPU is configured to be the checker CPU. The master CPU drives its output terminals, while the checker CPU changes its output terminals to function as input terminals. The respective terminals (e.g., “pins”) of CPUs
12
a
and
12
b
are coupled together. The checker CPU compares its intemally-generated values to those produced by the master CPU and received at the respective terminals. If any output signal produced by the master CPU does not match the respective output signal produced by the checker CPU, the checker CPU produces an error signal. The error signal may serve as notification to external error recovery hardware (not shown). For example, the error signal may be routed to a third maintenance CPU (not shown) or an interrupt controller (not shown) which initiates an error recovery routine in response to the error signal. The error recovery routine may involve “backing up” the software program running at the time the error occurred to an established “checkpoint” at which instruction execution may be reinitiated.
The master CPU initiates data read and write operations. In response to a memory read request from the master CPU, chip set logic
16
obtains data from memory unit
18
via memory bus
20
and provides the data to both CPU
12
a
and CPU
12
b
via processor bus
14
. During a memory write operation, chip set logic
16
receives the data from the master CPU and stores the data within memory unit
18
via memory bus
20
. In response to a read request from an address within an address range assigned to peripheral device
24
, chip set logic
16
obtains data from peripheral device
24
via system bus
22
and provides the data to both CPU
12
a
and CPU
12
b
via processor bus
14
. During a write operation to an address within an address range assigned to peripheral device
24
, chip set logic
16
receives the data from the master CPU and provides the data to peripheral device
24
via system bus
22
.
Several problems occur when implementing electronic computer system
10
. Most importantly, the signals driven upon the output terminals of a CPU often do not adequately reflect the current internal execution state of the CPU. For example, there may be a time delay of many system clock cycles before an activity within the CPU results in signals being driven upon the output terminals. In addition, CPUs
12
a
and
12
b
may include relatively large internal cache memory systems
26
a
and
26
b
. Such cache memory systems are capable of holding large numbers of instructions and data. CPUs
12
a
and
12
b
are capable of operating for extended periods using instructions and data stored in respective cache memory systems
26
a
and
26
b
. During these extended periods, any computational errors produced do not propagate to the terminals of CPUs
12
a
and
12
b
, and are hence not “visible” for detection using FRC. As a result, cache memory systems
26
a
and
26
b
tend to delay error detection. Early detection of an error is key to determining the cause of the error and reducing the likelihood that valuable data is lost due to the error.
Furthermore, the maximum amount of data which may be transferred over processor bus
14
in a given amount of time (i.e., the maximum “speed” of processor bus
14
) is limited by the increased electrical loading of two CPUs and signal reflections within the signal lines of processor bus
14
due to the multiple connection points (i.e., terminations). Electronic computer system
10
does not support separate “point-to-point” processor buses capable of much higher speeds.
It would be beneficial to have an electronic system and method implementing FRC by comparing “signatures” generated by each CPU. Each “signature” would include a relatively small number of bits, and would preferably be representative of the internal execution state of the CPU. Immediate comparisons of representative signatures would facilitate earlier error detection, especially when the CPUs include relatively large internal cache memory systems. In addition, comparing only such sig
Dutton Drew J.
Mudgett Dan S.
White Scott A.
Conley & Rose & Tayon P.C.
Daffer Kevin L.
Iqbal Nadeem
LandOfFree
Electronic system and method for implementing functional... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Electronic system and method for implementing functional..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Electronic system and method for implementing functional... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2827233