System and method for detecting errors using CPU signature

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06173416

ABSTRACT:

BACKGROUND OF THE INVENTION
This invention relates to a fault-tolerant computer system.
The traditional approaches to system reliability attempt to prevent the occurrence of faults through improved design methodologies, strict quality control, and various other measures designed to shield system components from external environmental effects (e.g., hardening, radiation shielding). Fault tolerance methodologies assume that system faults will occur and attempt to design systems which will continue to operate in the presence of such faults. In other words, fault-tolerant systems are designed to tolerate undesired changes in their internal structure or their external environment without resulting in system failure. Fault-tolerant systems utilize a variety of schemes to achieve this goal. Once a fault is detected, various combinations of structural and informational redundancy, make it possible to mask it (e.g., through replication of system elements), or correct it (e.g., by dynamic system reconfiguration or some other recovery process). By combining such fault tolerance techniques with traditional fault prevention techniques, even greater increases in overall system reliability may be realized.
SUMMARY OF THE INVENTION
According to the invention, a fault-tolerant computer architecture is provided wherein the effect of hardware faults is diminished. The architecture employs a main data bus having a plurality of interface slots for interconnecting conventional computer sub-systems. Such sub-systems may include a magnetic disk sub-system and a serial communication sub-system. The number and type of sub-systems may vary considerably, however, a central processor sub-system which encompasses the inventive elements of the invention is always included.
The central processor sub-system employs a plurality of central processing modules operating in parallel in a substantially synchronized manner. One of the central processing modules operates as a master central processing module, and is the only module capable of reading data from and writing data to the main data bus. The master central processing module is initially chosen arbitrarily from among the central processing modules.
Each central processing module comprises a means by which the module can compare data on the main data bus with data on a secondary bus within each module in order to determine if there is an inconsistency indicating a hardware fault. If such an inconsistency is detected, each module generates state outputs which reflect the probability that a particular module is the source of the fault. A synchronization bus which is separate from the main data bus interconnects the central processing modules and transmits the state outputs from each module to every other central processing module.
More specifically, each central processing module comprises a shared data bus connected to the main data bus through a first bus interface. A number of hardware elements are connected to the shared data bus including a read/write memory, an asynchronous receiver/transmitter circuit, a timer circuit, a plurality of control and status registers, and a special purpose read/write memory, the purpose of which is to store data corresponding to main data bus interface slots having a defective or absent computer sub-system.
Each module further comprises a comparator circuit which is part of the first bus interface, the purpose of which is to compare data on the main data bus with data on the shared data bus and generate state outputs in response thereto. A parity checking circuit is also part of the first bus interface and monitors data lines in the main data bus, generating a parity output which is used as an input to the comparator circuit.
A private data bus is connected to the shared data bus through a second bus interface. The private data bus is also connected to a plurality of hardware elements which may include a read/write memory, a read-only memory, and a “dirty” memory. The purpose of the “dirty” memory is to store data corresponding to memory locations in the read/write memory to which information has been written. As will become clear, this facilitates the copying of data from one central processing module to another. Also connected to the private data bus and controlling the operation of each central processing module is a central processing unit which operates in a substantially synchronized manner with central processing units in other central processing modules.
Finally, each central processing module contains a control logic circuit which is connected to and controls the first and second bus interfaces. The control logic circuit receives as its inputs the state outputs generated by the comparator circuits in every central processing module. The circuit, using these and other control signals described more specifically below, generates, among other things, control logic signals which indicate to the central processing unit whether a fault has occurred. If a fault is detected, each module then executes a routine which identifies the location of the fault, disables the failed module or sub-system, and then returns to the instruction being executed at the time the fault was detected.
Embodiments of this invention will now be described by way of examples only and with reference to the accompanying drawings.


REFERENCES:
patent: 4634110 (1987-01-01), Julich et al.
patent: 4700292 (1987-10-01), Campanini
patent: 4757442 (1988-07-01), Sakata
patent: 4837739 (1989-06-01), McGill et al.
patent: 4870704 (1989-09-01), Matelan et al.
patent: 4891810 (1990-01-01), De Corlieu et al.
patent: 4931922 (1990-06-01), Baty et al.
patent: 4933940 (1990-06-01), Walter et al.
patent: 5113522 (1992-05-01), Dinwiddie, Jr. et al.
patent: 5136595 (1992-08-01), Kimura
patent: 5226152 (1993-07-01), Klug et al.
patent: 5271023 (1993-12-01), Norman
patent: 5274646 (1993-12-01), Brey et al.
patent: 5627965 (1997-05-01), Liddell et al.
patent: 5630048 (1997-05-01), La Jole et al.
patent: 5799022 (1998-08-01), Williams
patent: 5802266 (1998-09-01), Kanekawa et al.
patent: 5889940 (1999-03-01), Liddell et al.
patent: 5933594 (1999-08-01), La Jole et al.
patent: 5993055 (1999-11-01), Williams
Williams, Tom “New Approach Allows Painless Move to Fault Tolerance.”Computer Design31 (5):51-53 (1992).
Yano, Yoichi et al., “V60/V70 Microprocessor and its Systems Support Functions,”Spring CompCon 88—33rdIEEE Computer Soc. Intl. Conf., pp. 36-42 (1988).

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for detecting errors using CPU signature does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for detecting errors using CPU signature, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for detecting errors using CPU signature will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2436750

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.