Error detection/correction and fault detection/recovery – Pulse or data error handling – Memory testing
Reexamination Certificate
2000-11-20
2004-09-28
Decady, Albert (Department: 2133)
Error detection/correction and fault detection/recovery
Pulse or data error handling
Memory testing
C714S754000, C714S823000
Reexamination Certificate
active
06799291
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates in general to memory configurations for computing systems, and in-particular to fault detection. More specifically, the present invention relates to a method and system for detecting a hard memory array failure in a memory system.
2. Description of the Related Art
Memory systems employed in conventional data processing systems, such as computer systems, typically include large arrays of physical memory cells that are utilized to store information in a binary manner. Generally in a conventional memory system, the memory cells are arranged in an array having a set number of rows and columns. Operatively, the rows are selected by row decoders that are typically located adjacent to the ends of the row lines. Each of the row lines is electrically connected to the row decoders so that the appropriate signals can be received and transmitted. The columns of the memory array are connected to input/output (I/O) devices, such as a read/write multiplexer. In the case of dynamic random access memories (DRAMs), the memory array columns are also connected to line precharging and sense amplifier circuits at the end of each column line to periodically refresh the memory cells.
There are two kinds of errors that can typically occur in a memory system. The first is called a repeatable or “hard” error. For example, a piece of hardware that is part of the memory system, e.g., an interconnect, is broken and will consistently return incorrect results. A cell within a memory array may be stuck so that it always returns “0” for example, no matter what is written to it. Hard errors typically include loose memory modules contacts, electrically blown chips, motherboard defects or any other physical problems. These types of errors are relatively easy to diagnose and correct because they are consistent and repeatable. The second kind of error is called a transient or “soft” error. This occurs when a memory cell reads back the wrong value once, but subsequently functions correctly. These error conditions are, understandably, much more difficult to diagnose and are, unfortunately, more common. Eventually, a soft error may repeat itself, but may surface only after a lengthy period of time. These types of errors result, for example, from memory devices where some cells are marginally detective, alpha particle events hitting on the cell area, or other problems which are not related directly to the memory. The good news, however, is that the majority of soft errors due to alpha particles result in single bit errors.
To preclude corrupted data from use, encoding techniques are developed and employed in data processing systems to provide for detection and correction of errors. The simplest and most well-known error detection technique is the single-bit parity code. To implement a parity code, a single bit is appended to the end of the data word stored in memory. For even parity systems, the value of the parity bit is assigned a value such that the total number of ones in the stored word, including the parity bit, is even. Conversely, for odd parity, the parity bit is assigned such that the total number of ones is odd. When the stored word is read, if one of the bits is erroneous, the total number of ones in the word must change so that the parity value for the retrieved data does not match the stored parity bit. Thus, an error is detected by comparing the stored parity bit to a regenerated check bit calculated for the data word as it is retrieved from memory.
Although a single-bit parity code effectively detects single-bit read errors, the system has limits. For example, if there are two errors, the parity value for the data remains the same as the stored parity bit because the total number of ones in the word stays odd or even. In addition, even though an error may be detected, the single-bit parity code cannot determine which bit is erroneous, and therefore the failed memory cell cannot be identified or corrected.
To provide error correction and more effective error detection, various error correction codes were developed which not only determine that an error has occurred, but also indicate which bit is erroneous. The most well-known error correction code is the Hamming code, which appends a series of check bits to the data word as it is stored. When the data word is read, the retrieved check bits are compared to regenerated check bits calculated for the retrieved data word. The results of the comparison indicate whether an error has occurred, and if so, which bit is erroneous. By inverting the erroneous bit, the error is corrected. In addition, a Hamming code detects two-bit errors which would escape detection under a single-bit parity system. Hamming codes can also be designed to provide for three-bit error detection and two-bit error correction, or any other number of bit errors, by appending more check bits. Thus, Hamming codes commonly provide greater error protection than simple single-bit parity checks.
Unfortunately, Hamming codes require several check bits to accomplish the error detection and correction. For example, an eight-bit data word requires five check bits to detect two-bit errors and correct one-bit errors. As the bus grows wider and the number of bits of transmitted data increases, the number of check bits required also increases. Because modern memory buses are often 64 or 128 bits wide, the associated Hamming code would be very long indeed, requiring considerable memory space just for the check bits. However, for wider data words, the data to check bit ratio decreases. The check bit overhead, therefore, is smaller as compared to smaller data words.
A further problem is caused by modern random access memory (RAM) chips. In early memory systems, RAM chips were organized so that each chip provided one bit of data for each address. Current RAM chips, however, are frequently organized into sets of four bits of data for each address. If one of these RAM chips fails, the result is one to four potentially erroneous data bits. Unless the error correction code is designed for four-bit error detection, a four-bit error may go completely undetected. Incorporating a four-bit error detection (four-bit package error detection) and one-bit correction code in a 64-bit or 128-bit memory system, however, would require eight or nine check bits. Currently, a large percentage of memory array production is utilized in personal computers (PCs) and it is anticipated that in the following years through 2004, PCs will typically employ 32 MB to 256 MB memory systems.
Presently, memory arrays typically contain 256 Mbit devices and the trend is towards production of memory arrays that will contain 1 Gbit within two to four years. With the anticipated increase in memory array sizes, the present approach of utilizing 1 or 4 bit wide memory chip organization must be reconsidered. For example, employing the present 1 or 4 bit memory chip organization with a 32 bit wide data word will require 32 memory arrays (1 bit organization) or 8 memory arrays (4 bit organization). This will, in turn, result in a minimum granularity in a PC of 8 GB or 4 GB, respectively. This large amount of memory in a desktop or laptop computer is excessive and unnecessary and also has the added disadvantage of increasing the overall cost of the computer system. In response to the minimum granularity problem, memory array manufacturers are moving to 8, 16 and even 32 bit wide memory organization schemes with the corresponding increase in the number of check bits required for error detection and correction.
Accordingly, what is needed in the art is an improved error detection technique that mitigates the above-described limitations in the prior art. More particularly, what is needed in the art is an improved method for detecting hard failures in a memory array that utilizes memory organization schemes greater than 4 bit wide.
SUMMARY OF THE INVENTION
It is therefore an object of the invention to provide an improved memory system.
It is another object of the invention to prov
Kilmer Charles Arthur
Singh Shanker
Bluestone Randall J.
Chaudry Mujtaba
De'cady Albert
LandOfFree
Method and system for detecting a hard failure in a memory... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and system for detecting a hard failure in a memory..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and system for detecting a hard failure in a memory... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3268736