Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1997-10-02
2001-05-22
Peikari, B. James (Department: 2186)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S006130, C714S006130, C714S006130
Reexamination Certificate
active
06237108
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a multiprocessor system having a redundant shared memory configuration, which includes a plurality of processors and also a plurality of shared system memories that can be used in common by the plurality of processors and which allows the shared system memories to have redundancy in writing data in these memories.
More specifically, the present invention relates to a multiprocessor system, which ensures that the contents of each pair of shared system memories are equivalent to each other, for example, in the case where same data is written in each pair of shared system memories having a dual shared memory configuration.
2. Description of the Related Art
In recent years, it has been necessary for a relatively large amount of data to be processed at high speed and with high reliability, especially in a field of data communication system using a computer system. To satisfy this requirement, a multiprocessor system has been developed, which is constituted by a plurality of processors each including a central processing unit (usually abbreviated to CPU). Such a multiprocessor system have an ability for processing the data much higher than that in a single processor, by effectively utilizing a plurality of central processing units.
Further, in the above-mentioned multiprocessor system having a plurality of processors, even if a certain processor has failed during operation, any other processor can continue to process the data in place of the failing one. Namely, the above-mentioned multiprocessor system has a redundant configuration in regard to the processors, which can provide a fault tolerant computer system.
Further, to make such a fault tolerant computer system more complete and to ensure a data integrity of the whole multiprocessor system, it appears indispensable that shared system memories provided for supporting a data process at high speed also have a redundant shared memory configuration, e.g., a dual memory configuration.
More specifically, in regard to these dual shared system memories, it is necessarily required that data stored in one of each pair of the dual shared system memories is equivalent to data stored in the other one so as to ensure a conformity of the respective data, especially, when same data is to be written in each pair of dual shared system memories.
However, in general, the situations that a conformity of the respective data fails to be ensured may be brought about mainly in the following three cases. Here, to simplify an explanation about such situations, it is assumed that a multiprocessor system having a number of processor modules includes only one pair of dual shared system memory modules.
(1) The first is the case in which a write operation in one module of the dual shared system memory modules is finished in normal termination, while a write operation in the other module of the dual shared system memory modules is finished in abnormal termination, when a write access to these dual memory modules is carried out by a given processor module. Namely, data is not completely written yet in the other module of the dual shared system memory modules.
However, in this case, the above-mentioned processor module by which a write access was carried out still continues to operate. By means of an abnormal termination message, the processor module can recognize a specified address to which a write access has failed, and therefore assuredly rewrite the data corresponding to the specified address by executing a data recovery process. Consequently, it can be finally ensured that the data written in one of the dual shared system memory modules is equivalent to the data reconstructed by a data recovery process in the other one, and a problem concerning the above-mentioned first case does not become not so serious practically.
(2) The second is the case in which at least one of the dual shared system memory modules determines that it is impossible to continue to perform a normal operation due to a contradiction which has occurred by the shared system memory module per se. In this case, since the shared system memory module cannot assuredly preserve the data that was once stored therein any more, the memory module stops operating after that time and assumes a state of “HALT” (hereinafter, a state of “HALT” will be simply referred to as HALT).
Here, the contradiction in the shared system memory module per se means a logical contradiction which generally occurs when hardware of the shared system memory module is brought out of control. More concretely, as that type of contradiction, an abnormality of a sequencer in a system bus controller which is a connecting unit to a system bus and which will be described hereinafter, an abnormality of another sequencer in a memory controller in the shared system memory module, or the like can be mentioned.
In this case, data that was stored in the shared system memory module assuming HALT is not reliable at all. Accordingly, to assuredly carry out a data recovery process for this type of shared system memory module assuming HALT, it is inevitable to copy or duplicate all the content of the other shared system memory module in a normal state to the shared system memory module assuming HALT. Such a copy process or duplication process is usually executed after the shared system memory module assuming HALT is brought in a state in which a normal operation thereof can be performed.
For example, in the case where the shared system memory module assumes HALT due to a recoverable trouble, etc., that has temporarily occurred by an error of a software type, the normal state thereof can be realized by resetting this memory module assuming HALT and by canceling a state of HALT. On the contrary, in the case where the shared system memory module assumes HALT due to a serious trouble, etc., that has eternally occurred by an error of a hardware type and is usually difficult to remedy, the normal state thereof can be realized only by replacing this memory module assuming HALT with a new memory module.
Generally, in carrying out the above-mentioned copy process of all the content of the normal shared system memory module, the larger the storage capacity of shared system memory module becomes, the longer it takes to complete the copy process. Therefore, a system bus of a multiprocessor system is likely to be occupied by a copy access of a certain processor module for executing such a copy process. Further, in the case where a write access is carried out by some other processor module with respect to the shared system memory module in which such a copy process is being executed by a certain processor module, the copy access command from a certain processor module is likely to contend with the write access command from some other processor module. As a result of such a contention, when the copy process is completed, a disadvantage may occur in that all the data stored in a shared system memory module by the copy process is not always equivalent to that of the other normal shared system memory module.
However, in almost every case among the above-mentioned second case, one of the dual shared system memory modules stops operating to assume HALT due to a trouble that has occurred by some error of a hardware type. In such a case, practically, the replacement of one of the dual shared system memory modules in a state of HALT with a new system memory module becomes necessary, so as to copy all the data of the other one of the dual shared memory modules to the new system memory module after the replacement. Namely, to deal with the shared system memory module in a state of HALT, it is inevitable to carry out troublesome work, such as the replacement of such an abnormal memory module.
Fortunately, it should be noted that a probability, in which a shared system memory module per se assumes HALT due to some error of a hardware type, is extremely low, and that a trouble concerning the above-mentioned second case does not become so serious practically.
(3) The third is the case
Kabemoto Akira
Ogawa Toshio
Fujitsu Limited
Peikari B. James
Staas & Halsey , LLP
LandOfFree
Multiprocessor system having redundant shared memory... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Multiprocessor system having redundant shared memory..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multiprocessor system having redundant shared memory... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2524925