Method and apparatus for implementing fault-tolerant...

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Method and apparatus for implementing fault-tolerant... Method and apparatus for implementing fault-tolerant...

: 2000-06-30
: 2004-02-03
: Iqbal, Nadeem (Department: 2184)
: Error detection/correction and fault detection/recovery
: Data processing system error or fault handling
: Reliability and availability

: C714S038110
: Reexamination Certificate
: active
: 06687849
: ABSTRACT:

The present invention relates to fault-tolerant computer systems. In particular, it relates to reducing the memory required to operate a fault-tolerant system.
Processes run on computers are used to obtain many useful results. For instance, computer processes can be used for word processing, for performing calculations, for banking purposes, and for routing messages in a network. A problem with computer processes is that sometimes a process will fail. Although for some programs failure may have minimal negative consequences, in other cases, such as a banking application, the negative consequences can be catastrophic.
It is known to use fault tolerant processes to enable recovery from failure of a process. In particular it is known to use traditional process pairs where one process is the working process doing work and the other process is essentially a clone of the working process that takes over if the working process fails. See e.g. “Transaction monitor process with pre-arranged modules for a multiprocessor system”, U.S. Pat. No. 5,576,945, issued Nov. 19, 1996. The working process at intervals sends information about its state (“checkpoints”) to the backup process. (In process pairs, a checkpoint is sent at a minimum when an external state relating to the process is changed, such as when a file is opened or when a banking program does a funds transfer. Usually checkpoints are sent much more frequently, however.) Upon failure, the backup process begins execution from the last checkpointed state.
A problem with using traditional process pairs is that because a redundant process is set up about double the memory of running a single process is required. A copy of the contents of the memory image of the working process is created by the clone, including the state of the working space memory such as the stack. A copy of the program (“code segment”) is also maintained in memory. The code segment typically is an object file read from disk and loaded into memory at run time, and executed. The code segment is typically a relatively large portion of the memory image copy.
Memory is expensive and also takes up space. Accordingly, it would be advantageous to have a way to run fault-tolerant processes using less memory. It would further be advantageous for the time to takeover for a failed process to be short.
SUMMARY OF THE INVENTION
Systems and methods for implementing a memory-efficient fault tolerant computing system are provided by virtue of one embodiment of the present invention. A generic backup process may provide fault tolerance to multiple working processes. The backup process need not include a copy of the code segments executed by the working processes, providing very large savings in memory needed to implement the fault tolerant system. Alternatively, multiple backup processes provide fault tolerance but need not include duplicated code segments for the working processes they support.
In one embodiment, backup processes maintain state information about each working process including the contents of stack memory and heap memory. Checkpoint messages from a working process to a backup process keep the state information updated to facilitate takeover on failure. At takeover on failure, a backup loads a code segment associated with the working process and resumes using the current backup state information. With recent advances in processor speed, loading of the code segment occurs very quickly.
In one embodiment, a method for recovery of an original working process upon failure is provided. State information associated with the original working process is obtained. A copy of a code segment associated with the original working process is obtained and loaded into memory. The code segment is caused to execute as an active working process, using the state information.
A further understanding of the nature and advantages of the inventions herein may be realized by reference to the remaining portions of the specification and the attached drawings.

REFERENCES:
patent: 4819154 (1989-04-01), Stiffler et al.
patent: 4819159 (1989-04-01), Shipley et al.
patent: 4907232 (1990-03-01), Harper et al.
patent: 4959774 (1990-09-01), Davis
patent: 5235700 (1993-08-01), Alaiwan et al.
patent: 5287492 (1994-02-01), Reynders
patent: 5473771 (1995-12-01), Burd et al.
patent: 5533188 (1996-07-01), Palumbo
patent: 5576945 (1996-11-01), McCline et al.
patent: 5590274 (1996-12-01), Skarpelos et al.
patent: 5748882 (1998-05-01), Huang
patent: 5978933 (1999-11-01), Wyld et al.
patent: 6023772 (2000-02-01), Fleming
patent: 6035415 (2000-03-01), Fleming
patent: 6044475 (2000-03-01), Chung et al.
patent: 6141773 (2000-10-01), St. Pierre et al.
patent: 6256751 (2001-07-01), Meth et al.
patent: 6332199 (2001-12-01), Meth et al.
patent: 6338126 (2002-01-01), Ohran et al.

Affiliated with

Cherf Gerald Scott

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Cisco Technology Inc.

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

Iqbal Nadeem

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for implementing fault-tolerant... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for implementing fault-tolerant..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for implementing fault-tolerant... will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-3353913

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure