Electrical computers and digital processing systems: multicomput – Distributed data processing
Reexamination Certificate
1998-05-11
2001-02-20
Dinh, Dung C. (Department: 2783)
Electrical computers and digital processing systems: multicomput
Distributed data processing
C709S241000, C712S028000
Reexamination Certificate
active
06192391
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a process stop method and apparatus. In particular, the present invention relates to a method and apparatus for performing a process stop in a checkpoint processing executed in a distributed memory system that includes of a plurality of nodes interconnected in a network, each of which has at least one thread for parallel processing.
2. Description of the Related Art
Japanese Unexamined Patent Publication No. 8-263317 shows a checkpoint/restart processing system for controlling the freezing order of plural processes which relate to the synchronous (or exclusive) control in the checkpoint processing.
However, the checkpoint/restart processing system is applied to a shared memory multi-processor system, and not to a distributed memory multi-processor system. In the distributed memory multi-processor system, each of the processors has an own (local) memory which is not accessible to any processes in the other processors. If the checkpoint/restart processing system applies to the distributed memory multi-processor system, there is a possibility that plural processes in different processors cannot perform a synchronization for the checkpoint processing, because a process in a processor, which is frozen and is a counterpart of the synchronization, cannot respond to the request of the synchronization from any processes in other processors. In such a situation, the processes in other processors continue waiting for a response from the frozen process, which of course they will not receive.
Japanese Unexamined Patent Publication No. 2-287858 shows a restart system for a distributed processing system. In this restart system, whenever the communication control part in a processor requests to receive/send data from/to the other processors, a program which causes the communication control part to execute such processing is saved as checkpoint data.
However, the restart system in the latter example cannot save checkpoint data at any given time. Further, the frequent saving of checkpoint data has the adverse effect of lowering the performance of parallel processing.
SUMMARY OF THE INVENTION
An object of the present invention is to provide a process stop method and apparatus applicable to a distributed memory multi-processor system having a plurality of nodes, in which the system performs parallel processing which requires data communication between different nodes. The method and apparatus enables a stop processing in such a way as to allow the efficient collection of a checkpoint.
Another object of the present invention is to provide a process stop method and apparatus applicable to a distributed memory multi-process system whereby, during normal operation for parallel processing which requires data communication between different nodes, a checkpoint/restart function is executed such that the performance of parallel processing is not impaired.
A still another object of this invention is to provide a process stop method and apparatus applicable to a distributed memory multi-process system that allows the collection of a checkpoint at any desired point of time with respect to the progress of parallel processing that requires data communication between different nodes.
In the present invention, a distributed memory multi-processor system includes a plurality of nodes interconnected in a network. Each of the nodes has at least one processor and a local memory. Each of the nodes includes a management process and a parallel processing process. The management process manages the threads. The threads are distributed into some or all of the nodes because of a parallel processing.
Firstly, when a user inputs an external command to a node which has the smallest node number among the network, a management process in the node sends a stop request to all threads in the node, and waits for the threads to stop. In response to the stop request from the management process, each of the threads stops and notifies the management process of its own stop.
When the management process in the node receives the notification from all threads in the node, the management process in the node sends the stop request to another management process in another node which has the next smallest node number among the network.
In a case that the thread is trying to communicate with another thread when the thread receives the stop request from the management process, the thread does not notify the management process of its own stop until the thread can confirm communication with the other thread.
However, in a case that the thread cannot confirm communication with the other thread within a predetermined time, and in a case that the node number of the thread is larger than the node number of the other thread, the thread stops and notifies the management process of its own stop.
In a case that the management process in the node cannot find the other management process in the other node which has the next smallest node number and should receive the stop request, the management process notifies the other management process in the other node, which has the smallest node number in the network, that all threads in all nodes are stopped.
Lastly, when the management process in the node which has the smallest node number in the network receives the notification that all threads in all nodes are stopped, the management process causes other management processes in other nodes to make a checkpoint data.
REFERENCES:
patent: 5802267 (1998-09-01), Shirakihara et al.
patent: 5923832 (1999-07-01), Shirakihara et al.
patent: 6026499 (2000-02-01), Shirakihara et al.
Silva et al, “Global Ckeckpointing for Distributed Programs” pp. 155-162, IEEE 1992.
Dinh Dung C.
Foley & Lardner
NEC Corporation
Nguyen Dzung C.
LandOfFree
Process stop method and apparatus for a distributed memory... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Process stop method and apparatus for a distributed memory..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Process stop method and apparatus for a distributed memory... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2603865