Template based parallel checkpointing in a massively...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S006130, C714S020000

Reexamination Certificate

active

07487393

ABSTRACT:
A method and apparatus for a template based parallel checkpoint save for a massively parallel super computer system using a parallel variation of the rsync protocol, and network broadcast. In preferred embodiments, the checkpoint data for each node is compared to a template checkpoint file that resides in the storage and that was previously produced. Embodiments herein greatly decrease the amount of data that must be transmitted and stored for faster checkpointing and increased efficiency of the computer system. Embodiments are directed to a parallel computer system with nodes arranged in a cluster with a high speed interconnect that can perform broadcast communication. The checkpoint contains a set of actual small data blocks with their corresponding checksums from all nodes in the system. The data blocks may be compressed using conventional non-lossy data compression algorithms to further reduce the overall checkpoint size.

REFERENCES:
patent: 5630047 (1997-05-01), Wang
patent: 5712971 (1998-01-01), Stanfill et al.
patent: 5845082 (1998-12-01), Murakami
patent: 5922078 (1999-07-01), Hirayama et al.
patent: 5941999 (1999-08-01), Matena et al.
patent: 5974425 (1999-10-01), Obermarck et al.
patent: 5996088 (1999-11-01), Frank et al.
patent: 6052799 (2000-04-01), Li et al.
patent: 6195760 (2001-02-01), Chung et al.
patent: 6266781 (2001-07-01), Chung et al.
patent: 6289474 (2001-09-01), Beckerle
patent: 6691245 (2004-02-01), DeKoning
patent: 6823474 (2004-11-01), Kampe et al.
patent: 6892320 (2005-05-01), Roush
patent: 6895416 (2005-05-01), Gara et al.
patent: 6952708 (2005-10-01), Thomas et al.
patent: 6959323 (2005-10-01), Tzeng et al.
patent: 7065540 (2006-06-01), Chandrasekaran et al.
patent: 7096392 (2006-08-01), Sim-Tang
patent: 7162698 (2007-01-01), Huntington et al.
patent: 7197665 (2007-03-01), Goldstein et al.
patent: 7203863 (2007-04-01), Pavlik et al.
patent: 7216254 (2007-05-01), Rajan et al.
patent: 7237140 (2007-06-01), Nakamura et al.
patent: 7260590 (2007-08-01), Williams
patent: 7287180 (2007-10-01), Chen et al.
patent: 7293200 (2007-11-01), Neary et al.
patent: 7296039 (2007-11-01), Chandrasekaran et al.
patent: 7313555 (2007-12-01), Klier
patent: 7356734 (2008-04-01), Ricart et al.
patent: 7363537 (2008-04-01), Svarcas et al.
patent: 7363549 (2008-04-01), Sim-Tang
patent: 7370223 (2008-05-01), Olmstead et al.
patent: 2002/0023129 (2002-02-01), Hsiao et al.
patent: 2003/0078933 (2003-04-01), Gara et al.
patent: 2003/0115291 (2003-06-01), Kendall et al.
patent: 2004/0054800 (2004-03-01), Shah et al.
patent: 2004/0103218 (2004-05-01), Blumrich et al.
patent: 2004/0153761 (2004-08-01), Lee
patent: 2005/0065907 (2005-03-01), Chandrasekaran et al.
patent: 2005/0267885 (2005-12-01), Klier
patent: 2006/0018253 (2006-01-01), Windisch et al.
patent: 2006/0117208 (2006-06-01), Davidson
patent: 2006/0282697 (2006-12-01), Sim-Tang
patent: 2007/0277056 (2007-11-01), Varadarajan et al.
patent: 2008/0126445 (2008-05-01), Michelman
Cummings, D.; Alkalaj, L., “Checkpoint/rollback in a distributed system using coarse-grained dataflow,” Fault-Tolerant Computing, 1994. FTCS-24. Digest of Papers, Twenty-Fourth International Symposium on, pp. 424-433, Jun. 15-17, 1994.
Plank, J.S.; Kai Li, “ickp: a consistent checkpointer for multicomputers,” Parallel & Distributed Technology: Systems & Applications, IEEE [see also IEEE Concurrency], vol. 2, No. 2, pp. 62-67, Summer 1994.
Yuqun Chen; Kai Li; Plank, J.S., “CLIP: A Checkpointing Tool for Message Passing Parallel Programs,” Supercomputing, ACM/IEEE 1997 Conference, pp. 33-33, Nov. 15-21, 1997.
Bosilca, G; Boutellier, A.; Cappello, F.; Djilali, S.; Fedak, G; Germain, C.; Herault, T.; Lemarinier, P.; Lodygensky, O.; Magniette, F.; Neri, V.; Selikhov, A.; “MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes,” Supercomputing, ACM/IEEE 2002 Conference, pp. 1-18, Nov. 16-22, 2002.
Tzi-Cker Chiueh; Peitao Deng, “Evaluation of checkpoint mechanisms for massively parallel machines,” Fault Tolerant Computing, 1996, Proceedings of Annual Symposium on, vol. no., pp. 370-379, Jun. 25-27, 1996.
Petrini, F.; Davis, K.; Sancho, J.C., “System-level fault-tolerance in large-scale parallel machines with buffered coscheduling,” Parallel and Distributed Processing Symposium, 2004. Proceedings, 18th International, vol. no., pp. 209-, Apr. 26-30, 2004.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Template based parallel checkpointing in a massively... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Template based parallel checkpointing in a massively..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Template based parallel checkpointing in a massively... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-4137479

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.