Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1999-11-04
2004-08-24
Beausoliel, Robert (Department: 2113)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S004110, C714S005110, C714S048000
Reexamination Certificate
active
06782492
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to an error recovery technology in a parallel computer, and in particular to a memory error recovery technology in a cluster computer.
2. Description of the Related Art
One type of parallel computer is a cluster computer, in which a plurality of nodes that include at least one processor and a memory are connected together by a high-speed interconnecting network, such as a crossbar network. One of the advantages of a cluster computer is that the ratio of cost to capacity is superior. For example, while there is added cost for each node, when using a workstation with a high throughput, a ratio of cost to capacity that is almost as great as that of a super-computer can be obtained. In addition, another advantage is that the system is easily expanded in comparison to a parallel computer having central common memory wherein common memory is centrally allocated at one physical location. Furthermore, another advantage is that because each node is independent as one computer under the control of its proper operating system, it is possible to obtain a multi-job processing configuration, for example, executing a different job at different nodes that configure the cluster computer, or executing one job at a plurality of nodes simultaneously as a parallel program. Moreover, Japanese Unexamined Patent Application, First Publication, No. Hei 8-305677, is an example of a citation relating to such a cluster computer.
In addition, there are cluster computers that are distributed common memory parallel computers that allocate local memory to each node, and do not centrally allocate common memory to one physical location. However, because this is one type of common memory computer, the inter-processor communication model follows the common memory model. That is, communication between nodes is realized by the processor of each node directly accessing the common memory by an address command using a conventional memory access operation. Specifically, when a memory access request generated at one node is an access to the memory located at the same node, the memory access request is transferred to the memory of the same node, and the memory access origin is sent the access result. Otherwise, when a memory access request generated at one node is an access to memory located on another node, the memory in the other node is accessed by the memory access request being transferred to the other node through the interconnecting network, the access result being returned to the request origin node through the interconnecting network, and the memory access origin being notified.
The memory within the nodes that configure the cluster computer store important information that cannot be damaged, such as the operating system and other types of application programs that are executed by the node. Thus, memory is used that has an internal ECC (Error Checking and Correction) function that increases reliability. For example, a 1 bit error can be corrected with a Hamming code that adds 7 correction bits to 32 bits.
When a node carries out a memory access of memory with this kind of internal error correction function, a 1 bit error will be automatically corrected, and the memory access ends normally. However, if there is a 2 bit error, the memory access ends abnormally because it is not correctable, and an irrecoverable abnormal stop is returned as the memory access result. Because a hardware fault that results in an irrecoverable error being generated to the main memory forming the computer constitutes a very serious error, in conventional cluster computers, like general use computers, a system shutdown error notification is issued in the node that receives an irrecoverable abnormal stop as a memory access result, all programs being executed at that node are ended, and the system is stopped.
Therefore, when an irrecoverable error is generated in a common communication area that is located in the memory of each node for communication between nodes, the node that accessed this common communication area causes a system shutdown even if the access origin is memory located on another node. Because an original feature of cluster computers is each mode being able to operate independently, when an irrecoverable error is produced in memory not located on the same node, merely by accessing that location, this node will shut down the system, and this situation becomes a factor in severely decreasing the availability of the cluster computer.
Thus, it is an object of the present invention to stop a node that has accessed the common communication area from shutting down the system due to an irrecoverable error produced in the common communication area of memory located on another node, and increase the availability of a cluster computer.
In addition, in the case that a node, such as a kernel of the operating system, continues operating although an irrecoverable error has occurred in the node's privileged memory that stores necessary information, this node will inevitably shut down the system, and when the irrecoverable error occurs in a common communication area located on that node, it immediately shuts down the system, and this situation is a major factor in causing decreased accessibility of the cluster computer.
Thus, a second object of the present invention is to prevent one node from shutting down the system due to an irrecoverable error occurring in the common communication area of the memory located on that same node, and increase the availability of the cluster computer.
SUMMARY OF THE INVENTION
In order to obtain the first object of the present invention, each node in the cluster computer of the present invention sends to the memory access origin a system error stop notification when an irrecoverable error occurs at the time a memory access request in one node is sent to that same node's privileged area, and sends to the memory access origin a common communication area error notification when the irrecoverable error occurs at the time a memory access request is sent from one node to a common communication area of memory located on another node through the interconnection network.
In this manner, in addition to the conventional system error stop notification, indicating that the system will be immediately stopped because a fatal error has occurred, being sent as a notice that an irrecoverable error has occurred during a memory access, a common communication area error notification is defined that indicates that a minor error has occurred not connected with a system stop. In case an irrecoverable error occurs during a memory access request generated in the same node, if the access destination is the same node's privileged area, a system error stop notification indicating that severe error has occurred is generated. However, if the access destination is the common communication area of the memory located on another node, a common communication area error notification indicating that a minor error has occurred is sent rather than the system error stop notification. Thereby, it is possible to prevent the system being shut down by a node that accesses the common communication area of memory located on another node due to irrecoverable error occurring in that common communication area, and it is possible to increase the availability of the cluster computer.
In the case that an irrecoverable error occurred in the memory of another node due to a memory access request sent from a given node, in the end a common communication area error notification is sent to the memory access origin, such as the processor of the node that is the request origin, and the following types of method are used to determine where this common communication area error message is generated.
In one method, when a memory access request generated in one node is an access request to the memory of another node, a system control device in each node that carries out control of transferring the request to another node through the interconnection network will generate a c
Beausoliel Robert
Maskulinski Michael
McGinn & Gibb PLLC
NEC Corporation
LandOfFree
Memory error recovery method in a cluster computer and a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Memory error recovery method in a cluster computer and a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Memory error recovery method in a cluster computer and a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3327440