Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2006-03-20
2009-08-04
Chu, Gabriel L (Department: 2114)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S013000
Reexamination Certificate
active
07571347
ABSTRACT:
A system that provides fault tolerance in a parallel processing system. During operation, the system executes a parallel computing application in parallel across a subset of computing nodes within the parallel processing system. During this process, the system monitors telemetry signals within the parallel processing system. The system analyzes the monitored telemetry signals to determine if the probability that the parallel processing system will fail is increasing. If so, the system increases the frequency at which the parallel computing application is checkpointed, wherein a checkpoint includes the state of the parallel computing application at each computing node within the parallel processing system.
REFERENCES:
patent: 5664090 (1997-09-01), Seki et al.
patent: 5712971 (1998-01-01), Stanfill et al.
patent: 2005/0114739 (2005-05-01), Gupta et al.
patent: 2006/0168473 (2006-07-01), Sahoo et al.
patent: 2007/0168715 (2007-07-01), Herz et al.
Plank, James S., “ickp: A Consistent Checkpointer for Multicomputers”, 1994, IEEE.
Cao et al., “Design and Analysis of An Efficient Algorithm for Coordinated Checkpointing in Distributed Systems”, 1997, IEEE.
Li et al., “Low-Latency, Concurrent Checkpointing for Parallel Programs”, 1994, IEEE.
Kim et al., “An Efficient Protocol for Checkpointing Recovery in Distributed Systems”, 1993, IEEE.
Gross Kenny C.
Wood Alan P.
Chu Gabriel L
Park Vaughan & Fleming LLP
Sun Microsystems Inc.
LandOfFree
Method and apparatus for providing fault-tolerance in... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for providing fault-tolerance in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for providing fault-tolerance in... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-4104271