Electrical computers and digital processing systems: multicomput – Computer-to-computer data routing – Least weight routing
Reexamination Certificate
1996-03-06
2001-03-27
Banankhah, Majid (Department: 2151)
Electrical computers and digital processing systems: multicomput
Computer-to-computer data routing
Least weight routing
C710S260000
Reexamination Certificate
active
06209019
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to a data processing system, network, and data processing method which increase reliability by executing processes using programs in a plurality of versions.
2. Description of the Prior Art
In systems, such as industrial systems, traffic control systems, and power plant systems such as a nuclear power plant, where ever-changing data is processed and the system is controlled based on the processing result, the safety of the system must be maintained under any condition.
This means that reliability is vital to data processing system devices such as computers or computer networks which are used in those systems. In particular, system errors have significant effects on those devices. System errors are caused by hardware errors or program bugs. Recently, as hardware reliability increases, program reliability has become more important. However, as programs become large and complicated, it is virtually impossible to create error-free programs.
To solve this problem, software techniques which make a program appear free of errors have been proposed even when the program has errors.
One of those techniques widely accepted is what we call a multiversionning method. This method puts a computer in the multiversionning mode to allow the programs in the computer to be run in the multiversionning mode. It enables the system to continue normal operation even if a system error occurs. However, running a program in the multiversionning mode requires that a plurality of program copies must be created. So, if the program has one or more bugs, multi-versioned programs stop due to the same bug, causing the computer or a part of system functions to stop. To solve this problem, the methods given below have been proposed:
(1) N versions program method
In this method, a plurality of designers create programs which perform the same function using different procedures. Thus, a plurality of programs, each with its own version, are created to perform the same function. This “N versions program method” allows a plurality of programs to be run in the computer concurrently. These programs, driven by the program called a driver which behaves just like an operating system (OS), are synchronized by the driver each time they reach pre-defined checkpoints. When the majority of programs produce the same result, that result is selected as a correct output.
(2) Recovery block method
This method is described below using program B and its alternate programs B′ and B″.
In this recovery block method, checkpoints, at which a predetermined amount of processing ends, are provided for program B and alternate programs B′ and B″, and the test (acceptance test) is made to check if the execution result of processing matches the desired value. First, program B is run, and the acceptance test is executed at a checkpoint to check if the execution result is acceptable. If the execution result of program B is acceptable, processing continues; otherwise, alternate program B′ is started.
When the execution result is rejected, alternate program B′ is started to perform alternate processing. At this time, the internal status at the preceding successful checkpoint, that is, the checkpoint data accepted by the acceptance test at the preceding checkpoint, is passed to alternate program B′ for use in alternate processing. The result of this alternate processing is then checked by the acceptance test and, if it is rejected, alternate program B″ is started. This processing is repeated until the execution result is accepted by the acceptance test or until there is no more alternate programs. Therefore, if the execution result of alternate program B″ is also rejected, program B is determined to be unreliable.
(3) Self-checking method
Alternate programs B′ and B″ are started after program B fails in the acceptance test in the recovery block method described above, while alternate programs B′ and B″ are run concurrently with program B in the self-checking method. Note that, in the self-checking method, alternate program B′ takes over the processing of program B and outputs data to external programs only after the acceptance test of program B fails,
3. Problems to Be Solved by the Invention
The methods described above have the following problems. In the “N versions program method”, when a plurality of programs in different versions are run concurrently, the system must wait, at each checkpoint, for the slowest program to end. Therefore, during daily operation, the overall system performance is determined by the processing performance of the slowest program.
In the “recovery block method” or “self-checking method”, an alternate program takes over processing only after the program fails in the acceptance test. This take-over processing requires time and delays program processing. In addition, since an alternate program usually places emphasis on less bugs rather than on performance, program B′ is slower than program B during concurrent operation. This loses the advantage of concurrent operation. An attempt to run alternate programs B′ and B″ concurrently with, and as fast as, program B will result in the disadvantage associated with the “N versions method”.
Conventional program high-reliability methods are intended for increasing the software reliability rather than for detecting and recovering from hardware failures. There is a method in which the same program is run in other computers concurrently so that the program keeps running even when an error occurs in one of computers. However, if it is difficult to determine whether the error is a software error or a hardware error, the conventional software high-reliability method does not solve the problem; that is, when a hardware error occurs in a system where this method is employed, control is passed to a poorer-performance alternate program and, as a result, the performance is degraded.
Even if it is possible to determine whether a system error is a hardware error or a software error, the program for that determination must always be active. In addition, there is a possibility that a hardware error and a software error occur at the same time. This makes the determination and the subsequent take-over processing more difficult.
SUMMARY OF THE INVENTION
This invention is intended to solve the problems associated with the conventional techniques. It is an object of this invention to provide a data processing system, computer network, and data processing method which can pass processing to an alternate program without being affected by the poorest performance version and without wasting time in passing processing to an alternate program. It is also an object of this invention to provide a data processing system, computer network, and data processing method which are capable of keeping a program running not only when a software error occurs, but also when a hardware occurs.
In accordance with one aspect of the present invention, a data processing system allows a plurality of programs, each designed according to its own version to run concurrently, the data processing system executing processes each corresponding to one of the plurality of programs, the data processing system comprising executing means, provided for each of the plurality of programs, for executing a first process corresponding to a first of the programs; detecting means for detecting an execution point of the first process executed by the executing means; notifying means for issuing an interrupt instruction to programs other than the first program when an execution point of the first process detected by the detecting means has reached a pre-defined execution checkpoint; sending/receiving means for sending the interrupt instruction issued by the notifying means to all programs other than the first program and for receiving an interrupt instruction from one of the other programs; and interrupt controlling means for controlling the executing means to interrupt the pr
Hasegawa Tetsuo
Kaibe Hiroshi
Okataku Yasukuni
Seki Toshibumi
Tamura Shinsuke
Banankhah Majid
Finnegan Henderson Farabow Garrett & Dunner L.L.P.
Kabushiki Kaisha Toshiba
Lao Sue
LandOfFree
Data processing system, computer network, and data... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Data processing system, computer network, and data..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Data processing system, computer network, and data... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2468817