Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-11-03
2001-10-30
Ray, Gopal C. (Department: 2181)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S010000, C709S200000, C700S002000
Reexamination Certificate
active
06311289
ABSTRACT:
BACKGROUND
This invention relates generally to fault tolerant computer systems and more particularly to redundant processor computer systems used in support of telephone systems.
Modem telephone systems handle large volumes of time critical information on a routine basis. In such systems fault tolerance is a high priority and the need exists for a redundant processor system. APZ is an example of a computer system in which both processors execute in lockstep. Tandem Integrity is an example of a triple redundancy processor system. In fault tolerant systems there are at least two Central Processing Units (CPU's) that run in parallel where one of the two CPU's is always in an Executive (EX) state and the other is in a Stand By (SB) state. Both CPU's run the same microcode and execute the same instructions. The difference between the EX CPU and the SB CPU, as the two processors will be referred to, is that the only CPU whose output is actually used by the system it supports is that of the EX CPU. Of course, as is normal in fault tolerant systems, if the EX CPU should ever fail, or otherwise be taken out of operation, the output connections would be immediately switched to the SB CPU. In this manner the SB CPU could take over the processing chores of the system at any time, thus making the system fault tolerant. Examples of well known CPU's include the X86 family, Pentium and Pentium II CPU's manufactured by the Intel Corporation.
At this point a simple distinction should be drawn between a basic fault tolerant system and a basic multiprocessor system. In general, multiprocessor systems use more than one processor to work on different parts of the same job. Usually, in multiprocessor systems, there is one “manager” processor that divides up the job into smaller tasks and assigns the tasks to the other processors in the multiprocessor system. The managing processor may then begin a task itself or oversee the entire job trying to optimize the system's performance by insuring all of the processors in the system are processing an equal amount of work. Load sharing is a term often used to describe the type of work done by basic multiprocessor systems. In contrast, a basic fault tolerant system does not divide up the work load. Instead, each processor in a fault tolerant system does the entire job so that more than one processor is performing the same job. The same instructions and data are processed by each of the processors in a basic fault tolerant system. In this way, if one processor fails at any time another processor can take its place and take over the processing chore for the failed procesor. A multiprocessor system would have faster results on a large problem than a fault tolerant system, but, if one of the processors in each of the above system failed, the fault tolerant system would be the only one to complete the job without user intervention.
There are many reasons why one of the processors in a fault tolerant system may be temporarily taken out of operation. Maintenance activities, such as repair of a faulty board or upgrading of the operating system, may force temporary “down time”. Detection and subsequent correction of a fault or error are examples of other circumstances that may cause a processor in a dual processor system to be temporarily taken “off line”. The terms CPU and processor are well known equivalents in the art and will be used interchangeably in this document. No matter what the reason, after either one of the processors has been off line, it will no longer be in synchronization with the processor which remained on line. Synchronization in this context refers to timing and also to having identical data in each processor. The areas of concern, in regards to the data in each processor are the internal registers and main memory. Main memory, or just memory, refers to the random access memory or RAM associated with each CPU. Main memory may be divided into more than one portion, with each portion having defined addressing limits. Also, each CPU may have more than one “main” memory, in which case each memory would be given a different name to avoid confusion and addressing limits would not be a concern. The state of a CPU is defined by the contents of the internal registers, or hardware registers, of the CPU. It will be understood that, although the state of a CPU may include small memories such as caches and tables which may be used for branch prediction and linking purposes, the contents of register memory is generally accepted as defining the state of a CPU.
Prior to a restart, the processor which was taken off line, or faulty processor as it will be referred to, must be updated with the state of the processor which remained on line, or current processor. In other words, the contents of the current processor's internal registers must be loaded into the internal registers of the faulty processor. The memory of the faulty processor also needs to be loaded with the data in the memory of the current processor. This entire process is called updating or re-integration.
The challenge involved in re-integration is to complete the process in as little time as possible. Time is of the essence in the re-integration process because both CPU's must be involved in the re-integration process. Therefore, system application execution is temporarily stopped. As a result, overall system throughput is reduced. In dual-processor operations, degradation of system performance is directly proportional to the length of time required for re-integration. It is therefore important to provide a method by which a processor in a dual processor system may be updated in as little time as possible.
Two known methods of doing re-integration in dual processor computers can be referred to as “copy main memory” and “copy instruction execution results”. In copy main memory, which is illustrated in
FIG. 1
, the contents of main memory (EX)
12
are copied to the main memory (SB)
22
of the SB CPU
2
. The state of the EX CPU
1
, which is held in registers
11
, is then copied to both main memory (EX)
12
and main memory (SB)
22
. Synchronous restart is initiated reading the data formerly held in registers
11
into both CPU's in parallel. This method is used, for example, in the IMP and Tandem Integrity fault tolerant systems. The drawback with this method is that it is slow because main memory, which may be an order of magnitude slower than registers, is intimately involved. Further, transfer of the state of the EX CPU
1
requires two main memory operations, a write and a read, since the contents of the internal registers must first be transferred to memory before they can be transferred to the SB CPU
2
. The result is a long stop of application execution, which as mentioned above, degrades system performance.
FIG. 2
illustrates the second known re-integration method, copy instruction execution results. This method copies the results of all instructions that execute in the EX CPU
1
to the SB CPU
2
. In this figure, EX CPU
1
is the current processor and SB CPU
2
is the faulty processor. Instruction pipelines
15
and
26
represent the basic functions performed in each CPU, respectively. Stages of a typical pipelined processor include: fetch, decode, execute, memory access and writeback. Writeback unit
152
of the current CPU transfers the results of each executed instruction over update bus
31
to writeback unit
262
of the faulty CPU. Data from the registers and main memory of EX CPU
1
are also transferred through the writeback units of each processor. This method requires extra hardware in the writeback unit of each processor in order to transfer all of the required data.
In the copy instruction execution results method, the microinstruction execution unit in the faulty CPU receives only an address to its control memory from the current CPU. Consequently, the microprogram in both CPU's must be the same. This means that the faulty CPU is forced to follow the current CPU regardless of the contents in the faulty CPU's contr
Burns Doane Swecker & Mathis L.L.P.
Ray Gopal C.
Telefonaktiebolaget LM Ericsson (publ)
LandOfFree
Explicit state copy in a fault tolerant system using a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Explicit state copy in a fault tolerant system using a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Explicit state copy in a fault tolerant system using a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2610306