Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-10-29
2001-12-18
Ray, Gopal C. (Department: 2181)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S015000, C707S793000, C712S228000
Reexamination Certificate
active
06332200
ABSTRACT:
TECHNICAL FIELD
This invention relates, in general, to taking a checkpoint of a parallel program and, in particular, to capturing and identifying a complete and consistent set of checkpoint files for the parallel program.
BACKGROUND ART
A requirement of any robust computing environment is to be able to recover from errors, such as device hardware errors (e.g., mechanical or electrical errors) or recording media errors. In order to recover from some device or media errors, it is necessary to restart a program, either from the beginning or from some other point within the program.
To facilitate recovery of a program, especially a long running program, intermediate results of the program are taken at particular intervals. This is referred to as checkpointing the program. Checkpointing enables the program to be restarted from the last checkpoint, rather than from the beginning of the program.
When checkpointing a program, it is important to generate a complete new checkpoint file before destroying any old checkpoint file. This is to ensure that at any instant there is a valid checkpoint file from which the program can be restored. If an old checkpoint file is erased before the new checkpoint file is completed (or if the old checkpoint file is directly overwritten with the new checkpoint file), it is possible that a system failure will occur at precisely the moment when the old checkpoint file no longer exists, but the new checkpoint file is not yet valid. This causes a situation in which there is no valid checkpoint file.
When checkpointing a parallel program, there is an additional complication. The state of all the processes of the parallel program are to be saved in a consistent manner. Thus, in general, it is not sufficient to simply take a checkpoint of each of the processes individually. Instead, the processes are coordinated, so that the resulting checkpoints reflect a valid state of the parallel program, when taken as a whole.
A problem arises if any one of the processes has an inconsistent checkpoint file as compared to the others. For example, assume a parallel program has a plurality of processes and all but one of those processes completed taking a new checkpoint. If one of the processes that finished taking a checkpoint erases its old checkpoint file, then upon restart there is no complete set of consistent checkpoint files. This is because the one process no longer has an old checkpoint file, and the process that failed does not have a new checkpoint file.
Based on the foregoing, a need exists for a capability that ensures the capture of a complete and consistent set of checkpoint files for a parallel program. A further need exists for a capability that identifies a complete and consistent set of checkpoint files for a parallel program.
SUMMARY OF THE INVENTION
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a method of identifying a complete and consistent set of checkpoint files for a parallel program. The method includes, for instance, determining, by a plurality of processes of the parallel program, a plurality of version numbers representative of a plurality of current valid checkpoint files corresponding to the plurality of processes. The method further includes selecting from the plurality of version numbers a selected version number representative of a consistent set of checkpoint files for the parallel program.
In one embodiment, the consistent set of checkpoint files includes a plurality of checkpoint files corresponding to the plurality of processes and having the selected version number.
In a further embodiment, the method includes restoring the plurality of processes using the plurality of checkpoint files having the selected version number. In one example, prior to restoring, the plurality of processes verify that they have the plurality of checkpoint files with the selected version number.
In one embodiment, each of the plurality of current valid checkpoint files has a corresponding name, and each name includes one of the plurality of version numbers.
In a further example, the determining of the plurality of version numbers includes, for each process of the plurality of processes, identifying one or more valid checkpoint files corresponding to each process, in which each of the one or more valid checkpoint files has a corresponding name with a version number; and selecting from the one or more valid checkpoint files a maximum version number. Thus, the plurality of version numbers includes a plurality of maximum version numbers.
Further, the selected version number is a minimum version number selected from the plurality of maximum version numbers.
In another aspect of the present invention, a method of identifying a set of complete and consistent checkpoint files for a parallel program is provided. The method includes, for instance, selecting, by a plurality of processes of the parallel program, a plurality of current valid checkpoint files corresponding to the plurality of processes. The method further includes using the selected plurality of current valid checkpoint files to identify a consistent set of checkpoint files for the parallel program.
In another aspect of the present invention, a method of capturing a set of checkpoint files for a parallel program is provided. The method includes, for instance, providing a plurality of checkpoint files to be used during a taking of a plurality of checkpoints corresponding to a plurality of processes of the parallel program. Each of the plurality of checkpoint files has a name, which includes a version number. The method further includes taking a plurality of checkpoints using the plurality of checkpoint files.
In yet a further aspect of the present invention, a system of identifying a complete and consistent set of checkpoint files for a parallel program is provided. The system includes, for instance, means for determining, by a plurality of processes of the parallel program, a plurality of version numbers representative of a plurality of current valid checkpoint files corresponding to the plurality of processes; and means for selecting from the plurality of version numbers a selected version number representative of a consistent set of checkpoint files for the parallel program.
In another aspect of the present invention, a system of identifying a set of complete and consistent checkpoint files for a parallel program is provided. The system includes, for example, means for selecting, by a plurality of processes of the parallel program, a plurality of current valid checkpoint files corresponding to the plurality of processes; and means for using the selected plurality of current valid checkpoint files to identify a consistent set of checkpoint files for the parallel program.
In yet another aspect of the present invention, a system of identifying a complete and consistent set of checkpoint files for a parallel program is provided. The system includes, for instance, at least one computing unit adapted to determine, by a plurality of processes of the parallel program, a plurality of version numbers representative of a plurality of current valid checkpoint files corresponding to the plurality of processes. At least one computing unit is adapted to select from the plurality of version numbers a selected version number representative of a consistent set of checkpoint files for the parallel program.
In a further aspect of the present invention, a system of capturing a set of checkpoint files for a parallel program is provided. The system includes, for instance, means for providing a plurality of checkpoint files to be used during a taking of a plurality of checkpoints corresponding to a plurality of processes of the parallel program, wherein each of the plurality of checkpoint files has a name, which includes a version number; and means for taking the plurality of checkpoints using the plurality of checkpoint files.
In another aspect of the present invention, an article of manufacture, including at least one computer usable medium having computer reada
Agbaria Adnan M.
Meth Kalman Zvi
Gonzalez, Esq. Floyd A.
Heslin Rothenberg Farley & & Mesiti P.C.
International Business Machines - Corporation
Ray Gopal C.
Schiller, Esq. Blanche E.
LandOfFree
Capturing and identifying a complete and consistent set of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Capturing and identifying a complete and consistent set of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Capturing and identifying a complete and consistent set of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2599870