Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2000-04-28
2003-07-15
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S031000
Reexamination Certificate
active
06594785
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates generally to an improved system and method for performing fault recovery within a Symmetrical Multi-Processor (SMP) system having multiple processing partitions; and more particularly, relates to a system and method for isolating and handling faults within a failing partition in a manner that prevents the fault from creating a failure in a second, non-failing partition that shares at least one main memory segment with the failing partition.
2. Description of the Prior Art
Data processing systems are becoming increasing complex. Some systems, such as Symmetric Multi-Processor (SMP) computer systems, couple two or more Instruction Processors (IPs) and multiple Input/Output (I/O) Modules to shared memory. This allows the multiple IPs to operate simultaneously on the same task, and also allows multiple tasks to be performed at the same time to increase system throughput.
As the number of units coupled to a shared memory increases, more demands are placed on the memory and memory latency increases. To address this problem, high-speed cache memory systems are often coupled to one or more of the IPs for storing data signals that are copied from main memory. These cache memories are generally capable of processing requests faster than the main memory while also serving to reduce the number of requests that the main memory must handle. This increases system throughput.
Problems result where one or more of the system's processors, instruction processors or I/O processors (hereafter referred to as processors and I/Os or processor units and I/O units), has an error, and that error is capable of corrupting an area of the main memory or any other memory that is or may be shared with other still-operating processors or I/Os. Losing the entire shared memory area for all the processors when only one or a small number are failing or involved with a failure of some kind is problematic for the steady state performance and overall throughput of the computer system. Accordingly, addressing this concern is a priority in computer systems where continuous or maximizing throughput is a requirement.
The system the invention developed for and of the preferred embodiment is a Symmetrical Multi-Processor (SMP) System (sometimes called a Cellular Multi-Processing (CMP) system) that is capable of being partitioned into multiple, independent data processing systems. That is, the hardware of the System may be sub-divided into multiple processing partitions. Each of the partitions includes or comprises predetermined processors, processor caches, peripheral devices, and portions of the main memory associated or dedicated to the partition. A dedicated Operating System (OS) controls the hardware associated to the partition. Hardware interfaces are configured appropriately within the system to ensure that messages and data are only passed between the processors and peripheral devices within the same partition. Processing occurs within a partition relatively independently of processing that is being performed in any other partitions. Communication between partitions may occur using shared address ranges within the main memory. The specific mechanisms used to accomplish this communication are described in detail in the U.S. Patent Application entitled “Computer System and Method for Operating Multiple Operating Systems in Different Partitions of the Computer System and for Allowing the Different Partitions to Communicate with One Another Through Shared Memory”, referenced above.
By assigning a shared address range to multiple partitions of a data processing system, processors within different partitions may communicate efficiently. This is desirable when multiple partitions are performing related tasks. Alternative mechanisms of communication involve messages sent through input/output devices, and do not provide the throughput that a shared-memory scheme offers. However, utilizing shared memory presents unique problems related to error recovery. If a unit within a first partition fails such that main memory data that is shared between the first partition and a second partition is corrupted, the second (non-failing) partition may also experience a fault. This makes the entire data processing system less robust.
Another complication associated with the system of the preferred embodiment involves the use of write-back, versus store-through, caches. When write-back caches are employed, a copy of any data that is updated within a processor cache is not immediately stored back to main memory. The only copy of the updated data resides within the cache until the processor flushes the cached memory segment back to the main memory. Therefore, a failure within a partition may cause the only copy of valid memory data to be lost. To minimize this risk, it is important to allow all memory operations initiated by a partition prior to the occurrence of a fault to complete, even though subsequent operations will be abandoned to prevent corruption of system data.
One way to handle errors that affect memory data residing within a range of main memory shared between multiple partitions involves designating all shared data as unusable by both partitions. Although this recovery mechanism is relatively straight-forward to implement, it may result in the loss of a memory range that is critical to applications running on the non-failing partition. This approach does not provide a resilient error recovery mechanism.
Another mechanism for handling this problem involves allowing main memory to process memory requests following the issuance of a fault notification. According to this method, main memory determines, based on the receipt of an error indication, which memory requests should be serviced and which should be discarded. Because of latency between the detection of errors within the various units of the partition and the receipt of an error indication at the main memory, it may be difficult for the memory logic to determine which memory requests to process and which to discard. This may ultimately result in corruption of memory data. Moreover, by the time requests have been received by the memory, requests from the failing unit have already entered resources such as memory queues that are shared between the failing and non-failing partitions. This makes the process of determining which requests to process and which to discard more complex.
What is needed, therefore, is a system and method for recovering from an error within a first partition without affecting a second partition that shares main memory segments with the failing partition. The system and method should isolate errors as close to the failure as possible so that requests that are unaffected by the fault may be processed while requests made after the failure indication is received may be discarded.
SUMMARY OF THE INVENTION
In general, this invention provides an improved Symmetrical Multi-Processor (SMP) data processing systems and is particularly related to SMP systems having improved fault-handling capabilities. The invention is particularly geared toward providing a fault handling system for a multi-partition data processing system having multiple partitions that communicate via a shared main memory. Different forms of fault can call for variation in the process of fault handling and recovery in such systems. Elements of the invention provide for variable recovery with a goal of reducing or eliminating corruption of memory data and resilient error recovery. The kinds of errors or faults tracked by this system can be thought of as critical errors because they indicate unreliability of the system having the fault.
The present invention is particularly applicable to a hierarchical, multi-level, memory system that keeps track of all cache lines of data in a main memory, whether the owner of a cache line is in a local processor's cache away from the main memory or not, and whether the main memory is distributed across multiple Main Storage Units, each subdivided into “memory clusters”, as in th
Bauman Mitchell A.
DePenning James L.
Fellenser Frederick G.
Gilbertson Roger L.
Haupt Michael L.
Atlass Michael B.
Beausoliel Robert
Chiu Gabriel L.
Starr Mark T.
Unisys Corporation
LandOfFree
System and method for fault handling and recovery in a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for fault handling and recovery in a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for fault handling and recovery in a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3083501