Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-11-10
2001-04-10
Hua, Ly V. (Department: 2785)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C713S500000, C713S501000, C713S600000
Reexamination Certificate
active
06216236
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a processing unit for a computer, to a computer system incorporating one or more such processing units, and also e.g. a system bus and a main memory. The present invention also relates to a method of operating a computer system.
2. Summary of the Prior Art
The increasing use of computers in many aspects of human society has increased the need for those computers to operate reliably, and in a fault-free manner. For example, where banking or trading systems are based on computers, a temporary failure in the computer may result in significant economic loss. Furthermore, computers are increasingly used in situations where human life or health would be put at risk by failure of a system involving a computer. Therefore, it is increasingly important, that computers operate in a fault-free manner, or at least they can continue to operate reliably despite the occurrence of a fault.
The most likely source of a fault in a computer system is in the or a processor thereof. Therefore, consideration has been given to providing processor redundancy. If two processors of the same computer are arranged to carry out the same program (operation), then it is possible to detect a fault if, for any reason, those processors are not, in fact, carrying out the same program. Thus, by monitoring a pair of processors, it is possible to detect a fault in that pair.
If the pair are then mounted on a common board, a failure in the pair could be used to trigger a signal to replace the corresponding board. However, if such an arrangement is used, either the computer must shut-down when one board fails, or there must be some arrangement for continuing operation.
It should be noted that, throughout the present specification the term “board” or “support board” indicates a single indivisible support for one or more processors, and other associated circuitry. Such a board may be a printed circuit board, or may be a ceramic board. Furthermore, it is possible to envisage the computer in which the processor or processors and associated circuitry are integrated in a single semiconductor element (chip). Of course, in a computer, a plurality of such boards may be interconnected by a suitable board mounting system, but the term “board” in the present specification is not intended to denote the composite result of such a mounting system.
In U.S. Pat. No. 4,654,857, pairs of processors were mounted on respective boards, and connected to a common system bus or system buses. The pair of processors of each board were arranged to carry out the same program (operation), and the board had suitable means for detecting if the program, or the result of the program, was different between the two processors of the pair, this corresponding to a detection of a fault. Furthermore, U.S. Pat. No. 4,654,857 proposed that the same operation was carried out simultaneously in the processors of two boards. Therefore, if a fault occurred in the processors of one board, that board could be withdrawn from operation without the operation having to be stopped, since the operation could continue in the other board. Therefore, the faulty board could be replaced, with the operation then being duplicated in the new board and in the remaining board of the old pair. This arrangement was known as a “Pair and Spare” system.
An alternative was disclosed in JP-A-59-160899, in which, again, pairs of processors were mounted on respective boards. The same operation was then carried out by the two processors on any given board. Furthermore, the system was operated so that any board carrying out a particular operation (program) always and repeatedly transferred information about that operation to the main storage memory of the computer via the or each system bus. Then, if it was detected that the processors were not carrying out the same operation, which then corresponded to a fault, the information in the main storage memory was immediately transferred to another board which was not then in use. That new board could then continue the operation and the faulty board could then be replaced. Thus, JP-A-59-160899 proposed a software solution.
A hardware solution was proposed in JP-A-1-258057, in which a single processor was mounted on each respective board, and the outputs of three boards passed via a voting unit to the system bus or system busses. The processor of each board for a group of three boards was arranged to carry out the same operation, so that voting unit would normally receive three identical outputs from the three boards. If the output of any one board differed from the other two, that one board could be then declared faulty and the operation continued on the basis of two boards. The faulty board could then be replaced.
The above description of prior art has considered proposals for preventing failure of the computer due to a fault in a processor of the computer. Another possible source of fault is in the clock of the computer. Where processors are mounted on a common board, that board may have its own clock which generates clock pulses to the processors for synchronizing their operation. In such an arrangement, it is apparent that a failure in the clock would result in total failure of the board. The situation is made worse if the computer has a common clock, since then failure of that clock would result in failure of the whole computer.
Therefore, the article “Aircraft Highly Reliable Fault Tolerant Multiprocessor” by A. L. Hopkins Jr. et al in IEEE Vol. 66, no. 10, pages 1221 to 1239 (October 1990) proposed a computer in which there was more than one clock, and the phases of the clock pulses were matched using a phase locked loop arrangement. Thus, failure of one clock did not result in failure of the whole computer.
A third possible source of fault is in a cache memory of the computer.
The use of the cache memories, particularly in multiprocessor arrangements, has the advantage of speeding up the effective memory access time. Cache memories using a high copy back mode have been used recently. The copy back mode, unlike the conventional write through method which updates the storage memory in the write mode, writes data only into the cache memories in the write mode so as to minimize the load on the system bus or buses. However, cache memories using the copy back mode only store the most recent data, causing the problem of maintaining data reliability when a cache memory is faulty. A possible method for solving this problem is to add an error correction code to each cache memory. The use of an error correction code requires much time, both for checking and generation, causing the cache memory access time to increase.
A further problem is that, if a plurality of cache memories hold the same data, and a processor requires the data, it is necessary to inform the other processors of the update information. The procedure for this is called a cache memory coherence protocol. Some procedures are proposed as described in the article entitled “Cache Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model” by JAMES ARCHIBALD and JEAN-LOUP BEAR in ACM Transactions Computer Systems, Vol 4, No. 4, November 1986, pp 273 to 298. When these protocols update data, they output (broadcast) the information to the bus, and the other cache memories fetch (“snoop”) this output and update or erase (invalidate) their own data. The Illinois University Method (USA) is a protocol which is proposed in “A Low Overhead Coherence Solution for Multiprocessors with Private Cache Memories” by RUDOLPH, L and PATEL, J, in the Proceedings of the 11th International Symposium on Computer Architecture, 1984, pp 340-347. This protocol has a great effect in minimizing the load on the system bus or buses. These protocols use the relationship between a plurality of processors and memories, though access by the other bus users (for example, input/output units) is not taken into account.
Furthermore, recent microprocessors generally contain internal cache memories because of improved integratio
Araoka Manabu
Fukumaru Hiroaki
Iijima Saburou
Kanekawa Nobuyasu
Kanekawa Shinichiro
Antonelli, Terry Stout & Kraus, LLP.
Hua Ly V.
Tokyo, Japan
LandOfFree
Processing unit for a computer and a computer system... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Processing unit for a computer and a computer system..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Processing unit for a computer and a computer system... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2444600