Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1999-09-15
2003-01-07
Beausoleil, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S042000
Reexamination Certificate
active
06505306
ABSTRACT:
FIELD OF THE INVENTION
The invention is generally related to data processing systems such as computers and like electronic devices, and more particularly, to error detection and correction in a memory array implemented in a data processing system.
BACKGROUND OF THE INVENTION
Ensuring the integrity of data processed by a data processing system such as a computer or like electronic device is critical for the reliable operation of such a system. Data integrity is of particular concern, for example, in fault tolerant applications such as servers, databases, scientific computers, and the like, where any errors whatsoever could jeopardize the accuracy of complex operations and/or cause system crashes that affect large numbers of users.
Data integrity issues are a concern, for example, for many solid state memory arrays such as those used as the main working storage repository for a data processing system. Solid state memory arrays are typically implemented using multiple integrated circuit memory devices such as static or dynamic random access memory (SRAM or DRAM) devices. Such devices may be subject to a number of errors throughout their lifetimes, including what are referred to as “hard” and “soft” errors.
A hard error is generally a permanent failure of all or a portion of a memory device, typically due to a latent defect during manufacturing or some electrical disturbance that occurs during the operation of the device. A hard error, for example, may affect a single memory cell, a row or column of memory cells, an input/output port, or even an entire device. A soft error is generally a transient alteration in the state of data stored in a memory cell, often due to natural effects such as cosmic rays or alpha particles, and can be corrected by shutting down the device and restarting. Despite significant improvements in circuit fabrication technologies, however, no memory device is completely immune from errors, and as such, significant development efforts have been directed at handling such errors in a manner that ensures the continued integrity of the data stored in the memory array.
For example, complex error correction code (ECC) algorithms have been developed to address some of the errors that may arise during the operation of a memory array. Data is typically stored in a memory array in binary form including a plurality of bits arranged together to form logical “words” of data. Most ECC algorithms typically address the situation where a single bit in a word is faulty, an error condition known as a “single bit error”. To do so, most ECC algorithms store a separate error correction code along with a word of data, and a complex mathematical algorithm is used to both detect and correct for any single bit error in the word.
ECC algorithms typically cannot address the situation where multiple bits in a word are faulty. However, since single bit errors comprise the vast majority of all errors experienced in a memory array, ECC algorithms do a great deal to improve the data integrity of ECC-capable data processing systems. Moreover, to minimize the likelihood of multi-bit errors (also referred to as unrecoverable errors), many memory arrays arrange memory devices such that each device provides no more than one bit in any given word in the memory address space. Consequently, failure of any given device, or portion thereof, will only cause an unrecoverable error if an error is also present in another device that provides another bit of the same word.
Additional integrity protection may be available through the use of a redundant memory array, in which a portion of the data width in the memory array is reserved for use whenever an error is detected in another portion of the memory array. In systems in which no memory device supplies more than one bit of data for any given word, a failure of a particular device results in a failure in one bit of a word, and a process known as redundant bit steering (RBS) is used to redirect, or “steer” dataflow from the failed bit to a redundant bit allocated in the reserved space of the redundant memory array.
Often, the reserved space in a redundant memory array is allocated on a separate, dedicated memory device, although the reserved space could be allocated in existing devices as well. Regardless, when all or a portion of the addressable space of a device is determined to be faulty, the process of redirecting dataflow from the failed bit to a redundant bit is referred to as “replacing” the failed device with a redundant device, irrespective of the fact that other memory accesses allocated to the failed device may continue to be processed.
Replacing a failed device with a redundant device necessarily requires that the redundant device be initialized with the data from the failed device. The most straightforward manner of doing so would be to simply prohibit access to the memory array, copy the affected data over from the failed device to the redundant device, and then switch over to the redundant device for all future accesses. However, in most fault tolerant applications, it is not possible to prohibit accesses to the memory array for any appreciable amount of time. Consequently, a redundant memory array typically must be capable of handling non-initialization operations concurrently with initialization of a redundant memory device.
For example, one manner of initializing a redundant device while maintaining the availability of the memory array is to simply switch over to the redundant device and allow ECC logic to correct any single bit errors in the redundant bit supplied by the device. Over time, stores to the redundant device would fill the device with correct data. However, without initialization, the initial state of the data in the redundant device at the time of switchover cannot be known, and as such, statistically 50% of all accesses involving the redundant device will require the redundant bit to be corrected. Reliability then becomes a concern with this approach, since for any single-bit error in another memory device, there is roughly a 50% chance that an error in the redundant bit will also occur, resulting in an unrecoverable multi-bit error.
Another approach is to attempt to copy data over from the failed device to the redundant device concurrently with the processing regular store and fetch operations submitted to the memory array, a process known as “cleaning” the redundant device. Data is typically copied over by sequentially fetching segments of data allocated to a failed memory device, passing the data through normal ECC logic, and storing the corrected data segments back into the redundant device. During such operations, however, typically the hardware that substitutes the redundant device for the failed device is controlled such that regular fetch or store operations directed to an uncleaned area of the failed device are directed to the failed device, while operations directed to the cleaned area are directed to the redundant device. In practice, however, such operations are problematic to implement given that the boundary of the cleaned and uncleaned areas is constantly moving, making it difficult to determine whether an access is directed to a cleaned or uncleaned area of a device. These difficulties increase the risk that the failed device will be utilized for accesses to the cleaned area, or that the redundant device will be utilized for accesses to the uncleaned area, introducing potential data integrity concerns. Moreover, the logic required to properly control the dataflow between the failed and redundant devices is more complex, requiring additional hardware and increasing the cost of a memory controller design.
Therefore, a significant need continues to exist in the art for an improved manner of initializing a redundant device in a redundant memory array, and in particular, for an improved manner of initializing a redundant device which provides fast and efficient initialization while maintaining data integrity.
SUMMARY OF THE INVENTION
The invention addresses these and other problems associated with the prior art b
Blackmon Herman Lee
Drehmel Robert Allen
Haselhorst Kent Harold
Marcella James Anthony
Beausoleil Robert
International Business Machines - Corporation
Wilson Yolanda
Wood Herron & Evans
LandOfFree
Redundant bit steering mechanism with delayed switchover of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Redundant bit steering mechanism with delayed switchover of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Redundant bit steering mechanism with delayed switchover of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3023017