Fault tolerant computer system

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S011000, C714S037000

Reexamination Certificate

active

06438707

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to a fault tolerant computer system and to a method of fault tolerant operation of a computer system.
BACKGROUND OF THE INVENTION
Computers or computer systems are increasingly employed for fault sensitive applications, such as banking systems or telecommunications networks. Severe problems may arise if the computer fails, or even in case of a single faulty operation. For example, in a banking system an amount of money may erroneously be transferred between accounts, in a telecommunications system communication lines may be interrupted without notice, undesired connections may be established or the system may come to a complete halt for a prolonged period of time. Obviously, it is desirable to avoid such problems.
A generally known method to cope with the above problem is to replicate a computer system on a one-to-one basis, and to make both computer systems execute the same sequence of instructions. However, this will require a high inter-unit communication load between the two computer systems, since operations need to be checked and synchronized on a very detailed level. Further, computers increasingly operate at higher frequencies where the handling of the inter-unit communications becomes an important cost factor.
An approach to reduce a inter-unit communication load is described in U.S. Pat. No. 5,544,304. Commands are received and queued by both, an active and a stand-by unit. Only the active unit processes the commands. The system provides short messages which are transmitted between the active and stand-by units inquiring about, or providing the status of particular commands. A periodic handshaking is executed between the two units involving short signals which are exchanged between controllers of the active and stand-by unit.
However, in case of a failure, this system requires a long time to restart operations using the stand-by units, since with only periodic handshaking performed between the units, a high level of synchronization cannot be maintained.
SUMMARY OF THE INVENTION
It is therefore an object of the invention, to provide a fault tolerant computer system and a method of operating a fault tolerant computer system requiring a low communication load between a primary system and a backup system while allowing a high level of synchronization.
This object of the invention is solved by a Fault tolerant computer system, comprising: a primary system connected to external devices, including: a primary central processing unit for executing event processes, an event process being a process executed upon the occurrence of a command at the primary system; primary memory means connected to the primary central processing unit for storing system data and application data; an event generator connected to the primary central processing unit for generating an event message each time the primary central processing unit halts the execution of an event process, the event message at least including information about the type of event process and the reason for halting the execution of the event process; at least one backup system connected to the primary system, including: a backup central processing unit for executing event processes, backup memory means connected to the backup central processing unit for storing system and application data; a buffer for receiving and intermediately storing a sequence of event messages from the primary system; and backup control means connected to the backup central processing unit, for scheduling the execution of event processes in accordance with the event messages.
The object of the invention is further solved by a method of fault tolerant operation of a computer system, including a primary system and at least one backup system, including the steps of: at the primary system: executing event processes by a primary central processing unit, an event process being a process executed upon the occurrence of a command at the primary system; generating an event message each time the primary central processing unit halts the execution of an event process, the event message at least including information about the type of the event process and the reason for halting execution of the event process; transmitting each event message to at least one backup system; at the at least one backup system: recording and intermediately storing the event messages from the primary system in a buffer; scheduling the execution of event processes of corresponding event messages at the buffer; and executing the event processes by the backup central processing unit in accordance with the event messages.
According to the invention, a primary system comprises a primary central processing unit, primary memory means for storing system data and application data and an event generator for generating an event message each time the primary central processing unit halts the execution of an event process. The event message at least includes information about the type of event process and the reason for halting the execution of the event process. At least one backup system is provided, comprising a backup central processing unit, backup memory means and a buffer for receiving and intermediately storing a sequence of event messages received from the primary system. Backup control means schedule the execution of event processes corresponding to respective event messages. The event processes are executed at the primary system and at the backup system in the same manner.
Advantageously, the primary processing unit reports an event message to the backup system only in case the execution of an event process is halted. This allows a significant reduction of inter-unit communications, a detailed check of the status of the at least one backup system by the primary system is no longer required.
Since at the at least one backup system all necessary information about the event process and the reason for halting the execution of the event process is known via the event messages, the at least one backup system is able to replicate the course of execution of the event processes at the primary system. This includes data accessed, generated or otherwise affected, and includes halting an event process at exactly the same location or point in time, i.e., after the same number of instructions, as before at the primary system.
With an exactly identical execution of event processes at the primary system and at the at least one backup system, a high level of synchronization between the states of the primary system and the at least one backup system, including memory contents, may be achieved. It is not any longer necessary to check, e.g. memory means on a detailed level or to report changes to the memory means, as it was required previously. The at least one backup system will apply exactly the same changes to the data base or system data as they were applied at the primary system.
In an advantageous embodiment of the invention, two possible reasons for halting an event process are considered. Firstly, an event process can be terminated normally, i.e. if the execution of the corresponding command has been completed. Secondly, an event process may be interrupted, e.g., by a further command, requesting the execution of another event process and having a higher priority level. Thus, information will be included into the event message whether the event process was halted due to a normal termination or due to an interrupt.
In a further advantageous embodiment of the invention, means are provided for generating event data indicative of the execution of an event process both at the primary system and at the at least one backup system. Further, means are provided, for detecting a system fault based on a comparison of the event data generated at the primary system and at the at least one backup system. Thus, it can be determined whether the operation of the computer system is fault free. In case it is detected, that a fault occurred at the primary system, a backup system may be selected to assume function as new primary system. A fault may include a software faul

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Fault tolerant computer system does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Fault tolerant computer system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fault tolerant computer system will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2922534

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.