Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1999-07-02
2003-06-03
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S005110, C714S701000, C714S761000, C714S762000, C714S766000
Reexamination Certificate
active
06574746
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates to memory within computer systems, and more specifically, to error correction systems for detecting and correcting errors that may be present in data stored or transmitted to and from system memory.
2. Description of the Relevant Art
For data transmissions occurring within a given computer system, there is always a finite chance that the transmitted data is in error. This is true when the source of the transmitted data is a dynamic random access memory (DRAM). The majority of errors that occur in a DRAM chip are soft errors, which are correctable. Hard errors may occur as well, and some hard errors may be correctable, but their occurrence is typically less frequent than the occurrence of soft errors. Two primary sources of soft errors are alpha particles and cosmic rays. Since a DRAM stores a given bit via a charge, an alpha particle or cosmic ray can alter this charge, thereby changing the contents of a given memory cell.
As the amount of main memory in computer systems has continued to increase, the frequency of soft errors has increased correspondingly. Soft errors, if left uncorrected, can have adverse effects on system performance, corrupting data and even causing system crashes. One measure of the possibility of such failure is referred to as Mean Time between Failures (MTBF). Uncorrected soft errors can reduce the MTBF of a given computer system.
In order to counter the presence of errors, many computers employ error correction circuitry. Such circuitry is used to implement error correction codes (ECCs), which are used to detect and correct errors within a computer system. There are many different types of ECCs. Some of the more commonly used codes are referred to as Hamming codes, although many others have been developed. In some error correction systems, a bit pattern, such as one representing an ASCII character, is recoded with redundant bits, more commonly referred to a check bits. Groups of check bits are referred to as check words, and each data block stored in a DRAM may be protected by at least one check word.
Parity is another element of many error correction codes. Even parity is defined as adding a check bit so that the total number of logic ones in a given bit pattern is even, while odd parity requires adding a check bit so that the total number of logic ones is odd. In a system where even parity is in use, the receipt of a word, including check bits, containing an odd number of logic ones automatically indicates the presence of an error in the data. Receipt of a word containing an even number of logic ones in an odd parity system will also indicate the presence of an error.
Many error correction schemes can typically correct only one error within a given data word. Some error correction schemes allow the detection of two errors, but these schemes are usually unable to unambiguously correct both of them. As previously mentioned, many soft errors in a DRAM are caused by cosmic rays or Alpha particles. Alpha particles are localized phenomena, and in many cases, can alter the contents of multiple bits in the general area in which they occur. Similarly, cosmic rays, while not a localized phenomena, can nevertheless bombard a semiconductor memory with protons and neutrons, randomly altering the stored bits. Since a number of error correction schemes assign physically adjacent check bits within the DRAM to a given check word, there is an increased possibility of uncorrectable multi-bit soft errors occurring within a given check word. Furthermore, data bits protected by a given check word may be altered in the same manner.
The relationship between DRAM cell architecture and DRAM input/output (I/O) architecture may have an affect on the manner in which given check bits are assigned to check words. For example, in some DRAM chips, the cell layout will result in cells connected to data line D
15
being physically adjacent to cells connected data line D
0
, although these two bits are not logically adjacent. In other DRAM chips D
0
may be adjacent to D
1
, D
1
is adjacent to D
2
, and so on. Check bits on these data lines are often assigned to the same check word.
FIG. 1
illustrates one row of an example memory array within a DRAM, wherein check bits stored in adjacent locations are assigned to the same check words.
When certain phenomena occur, such as alpha particle radiation, multiple adjacent bits stored in a memory array can be altered, causing multi-bit errors. Multi-bit errors are generally more difficult to detect and correct than single-bit errors. A method to reduce the possibility of multi-bit soft errors from degrading system operation would be desirable. It would be further desirable to make multi-bit soft errors appear as single-bit soft errors, thereby making the errors easier to correct.
SUMMARY OF THE INVENTION
The problems outlined above may in large part solved by a system and method for error correction for improving multi-bit error protection in computer memory systems, in accordance with the present invention. In one embodiment, check bits forming a check word are stored in physically non-adjacent storage cells with respect to every other check bit of the given check word. Since there is a likelihood that soft and/or hard errors will cause physically adjacent cells to provide erroneous data, associating check bits with check words in this manner results in multi-bit errors appearing as single-bit errors to an error correction subsystem. Similarly, the likelihood of multi-bit errors occurring in the same check word may be reduced.
In one embodiment, a memory module includes a printed circuit board upon which a plurality of DRAM chips are mounted. Some of these DRAM chips are configured to store data words, while others store check bits associated with given data words. Each data word is protected by a number of check bits forming a check word. These check bits are generated according to a predetermined error correction scheme, such as a Hamming code. A group of check bits is referred to a check word. The check bits are stored in DRAM chips in such a manner that each check bit of a given check word is stored in a physically non-adjacent memory cell with respect to every other check bit in the given check word. Typically, each check bit from a given DRAM chip will be assigned to a different check word.
During a memory access, a data word is accessed, and check words associated with the accessed data word are received by an error correction subsystem. The error correction subsystem will then use the check words to check for the presence of an error, according to the predetermined error correction scheme. Since each of the check bits from a given DRAM chip is assigned to a different check word, multi-bit errors from a given DRAM chip will appear as a plurality of single-bit errors, which are generally easier to detect and correct. Furthermore, since check bits from a given DRAM are assigned to different check words, the likelihood of multiple errors occurring in the same check word may be reduced.
Thus, in various embodiments, the system and method for improving multi-bit error protection in computer memory systems may advantageously reduce the possibility of multi-bit errors occurring in the same check word. Furthermore, since check bits stored in physically adjacent locations within a DRAM are assigned to different check words, multi-bit errors caused by errors in check bits stored in physically adjacent locations will be appear as single-bit errors to the error correction subsystem. Since single-bit errors are generally easier to detect and correct, system reliability and data integrity may be advantageously enhanced.
REFERENCES:
patent: 4334309 (1982-06-01), Bannon et al.
patent: 4817052 (1989-03-01), Shinoda et al.
patent: 4845664 (1989-07-01), Aichelmann, Jr. et al.
patent: 4862462 (1989-08-01), Zulian
patent: 5134616 (1992-07-01), Barth et al.
patent: 5164944 (1992-11-01), Benton et al.
patent: 5270964 (1993-12-01), Bechtolsheim et al.
patent: 5291498 (1994-03-01
Carrillo John
Fang Clement
Ko Han Y.
Singhal Ashok
Wong Tayung
Beausoliel Robert
Kivlin B. Noäl
Maskulinski Michael
Meyertons Hood Kivlin Kowert & Goetzel P.C.
Sun Microsystems Inc.
LandOfFree
System and method for improving multi-bit error protection... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method for improving multi-bit error protection..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for improving multi-bit error protection... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3147731