Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-04-30
2001-02-13
Le, Dieu-Minh T. (Department: 2785)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S011000, C714S013000
Reexamination Certificate
active
06189112
ABSTRACT:
FIELD OF THE INVENTION
This invention is related to computer systems and particularly to multi-processors which may implement our transparent central processor (CP) sparing which is transparent to the user.
BACKGROUND OF THE INVENTION
Various computer manufacturers have what could be considered to be an interest in high availability systems. Typically, these systems implement a hardware error recovery mechanism to automatically, and transparently, recover from most transient errors. However, this error recovery will not be successful in most cases of solid, or non-transient, errors. Various mechanisms developed within IBM such as Processor Availability Facility (PAF), Concurrent CP Sparing, System Assist Processor (SAP) Reassignment provide for the recovery of a failed processor's work on a different processor. All the above prior mechanism have limitations.
Note that Amdahl has used the term “dynamic” in conjunction with their CP Sparing. However, to the best of our knowledge their implementation is more analogous to a combination of our the IBM Processor Availability Facility (PAF) and IBM's (IBM and S/390 are trademarks of International Business Machines Corporation) Concurrent CP Sparing as currently implemented on the IBM 9672 G4 than what is being described here as transparent processor sparing.
IBM's S/390 division, Hitachi, and Fujitsu (Amdahl) are those companies which are very active in this arena currently, but other competitors such as those who may attempt to use other kinds of processors, such as HP and Intel, may be interested in employing our development once they understand it if they attempt to produce mainframe-class systems. When a CP in a multiprocessor system encounters an error and enters a checkstop state, it is very desirable to not lose the work being done on that processor but instead move that work to another processor that is still operating in the system. In an S/390 system, several methods have been previously used to attempt to solve this problem:
Processor Availability Facility (PAF) moves the S/390 architected state of the failed processor to another currently operating (on-line) processor in the system with the help of the Operating System (OS). However, it has a several major limitations: 1) Since the mechanism uses the OS to perform the function, the customer is aware that the incident occurred, 2) if the CP happened to be executing in millimode at the time of the checkstop, it is not possible to invoke PAF since PAF only works at the S/390 architected state, not the micro-architected state which is a capability of G4 type S/390 systems (see e.g. U.S. Pat. No. 5,584,617) and 3) the customer has still lost the use of one of his CPs.
Concurrent CP Sparing as currently implemented on the IBM 9672 G4 models use a spare processor so that the customer does not lose access to one of his CPs when a checkstop occurs. It is used in conjunction with PAF. However, the customer is fully aware that a processor had a problem and it requires customer intervention (VARY a CP online) in some environments. It also may not work in some Logical Partition (LPAR) environments where certain processors are dedicated to certain partitions. Finally, it is based upon PAF for the application recovery and PAF will not be successful if the CP checkstop occurred while the processor was executing in millimode.
Although not directly related to preserving CP function, IBM's System Assist Processor (SAP) Re-assignment as currently implemented on the IBM 9672 G4 models use a spare processor to take over when a System Assist Processor (SAP) encounters an error. This mechanism can not be used for normal, non-SAP, CPs.
So to summarize, the mechanisms stated above work well as a whole but have limitations.
They do not work if a normal CP (non-SAP) was executing in millimode at the time of the failure.
All the above solutions are visible to the customer who then may be concerned that his hardware is “unreliable”.
Concurrent CP Sparing may not work in certain LPAR environments (e.g. dedicated uni-processor environments).
They will not work for uni-processor configurations even if a spare CP is available.
SUMMARY OF THE INVENTION
Our invention provides a mechanism enabling the micro-architected state of a checkstopped processor can be transferred to a spare processor in the system. The transfer is accomplished by the system using a hardware instruction built into the processor that is useable only by millicode. In addition, the transfer is initiated and managed by Licensed Internal Code (LIC) sequences. This code runs on both an external Service Element (SE) and as millicode on the processors themselves.
It will be recognized that we have provided a solution whereby the action of the system is completely transparent to the Operating System and to the users of the system. In fact, they are not even aware that a processor had a non-recoverable error.
REFERENCES:
patent: 4710926 (1987-12-01), Brown et al.
patent: 4823256 (1989-04-01), Bishop et al.
patent: 4852092 (1989-07-01), Makita
patent: 5159597 (1992-10-01), Monahan et al.
patent: 5233618 (1993-08-01), Glider et al.
patent: 5345567 (1994-09-01), Hayden et al.
patent: 5428779 (1995-06-01), Allegrucci et al.
patent: 5504859 (1996-04-01), Gustafson et al.
patent: 5584617 (1996-12-01), Houser
patent: 5694617 (1997-12-01), Webb et al.
patent: 5802359 (1998-09-01), Webb et al.
“A Local Sparing Design Methodology For Fault Tolerant Multiprocessors” by Dutt et al., Computer Math. Applications (UK), vol. 34, No. 11, Dec. 1997, p. 25-50.
“Organizational Redundancy For A Parallel Processor Machine” by Brantley et al., IBM Technical Disclosure Bulletin, vol. 28, No. 1, Jun. 1985, p. 417-418.
“A Fault-Tolerant Multi-Transputer Architecture” by Kumar et al., Microprocessors and Microsystems, vol. 17, No. 1, Jan./Feb. 1993, p. 75-81.
“On Reconfiguration Latency In Fault-Tolerant Systems” by Kim et al., 1995 IEEE Aerospace Applications Conference Proceedings, vol. 1, Feb. 4-11, 1995, Snowmass at Aspen, CO, IEEE 95TH8043 p. 287-301.
“1994 IEEE International Conference on Wafer Scale Integration” by Lea et al., Proceedings of the 6th Annual Int. Conference on Wafer Scale Integration, 1994, p. 401.
Murray Robert E.
Slegel Timothy John
Augspurger Lynn L.
International Business Machines - Corporation
Le Dieu-Minh T.
LandOfFree
Transparent processor sparing does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Transparent processor sparing, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Transparent processor sparing will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2566468