Memory error correction using redundant sliced memory and...

Error detection/correction and fault detection/recovery – Pulse or data error handling – Digital data error correction

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06397365

ABSTRACT:

BACKGROUND
Dynamic Random Access Memory (DRAM) is used extensively in a variety of applications, especially in conjunction with digital microprocessors. In a typical configuration, several Central Processing Units (CPUs) will be interfaced with a Processor and Memory address device (PMA), as shown in FIG.
1
. The PMA is interfaced with one or more Processor and Memory Data devices (PMD). Each PMD is interfaced with a plurality of Memory Modules (MM). The PMA functions to arbitrate the addresses received from each CPU, and directs each address to the correct PMD. The PMD receives the address and determines where within the MMs to read or write data. Each MM corresponds to a slice of to) data and is comprised of DRAM. The PMD also performs error correction operations.
The number of DRAM chips required to provide the needed memory capacity in a multi-processor system is large. The probability of a DRAM failing compared to the other components in the system is high. DRAMs can have single or multi-bit errors for a variety of reasons. Random single bit errors can often be caused by radiation bombardment. Cross talk on lines connected to the DRAM may also cause errors. Further, an entire DRAM device may fail. It is therefore desirable to provide some redundant memory, coupled with error detection and correction logic to minimize the adverse effect of the occurrence of errors. Preferably, an error detection and correction scheme minimizes the amount of redundant memory required while minimizing the computational overhead require for detection and correction. Typically, an error correction scheme is employed which reduces the probability of uncorrected errors to some acceptable level.
The classical approach to detection and correction of errors is by use of an error correction code (ECC). An error correction code associated with a slice of data is stored and utilized to determine if an error has occurred in the slice and to then correct the erroneous bit. Typical ECCs provide guaranteed single bit error correction and double-bit error detection. Additionally, many multi-bit errors can be detected. The weakness of these codes is that some multi-bit errors will appear to be single-bit errors and some multi-bit errors will not be detected at all (a no-error syndrome). More elaborate codes have been created which provide better detection and correction capability. These codes further reduce the possibility of data corruption at the expense of greater computational overhead.
Another solution targeted at an entire DRAM chip failure (either as a transient failure, or a permanent failure) is achieved by distributing the ECC across numerous DRAM chips so that no two bits covered by a single ECC domain are from a single DRAM chip. Thus, if the ECC code covers 64 bits of data, then all 64 bits of data are from different DRAMs. In this approach, a block of data is written to a DRAM in the memory system. Each bit of the block belongs to a different ECC domain and only one bit of each ECC may be stored on the DRAM. This approach works well in solving the problem of a single DRAM failure, but has some weaknesses. First, once a DRAM fails, any future problem (single bit or multi-bit errors) will cause the data to be non-correctable. This implies that field service personnel must quickly replace the failing DRAM component to ensure guaranteed levels of system availability. The second weakness of this approach is that since each bit of a DRAM memory line must belong to a different ECC domain, a large number of DRAMs must be addressed for error detection when a line of data is read. This results in significantly increased power consumption.
An alternative approach to error correction has been adapted from techniques used to solve disk errors. This approach is referred to as the RAID technique when applied to disks (Redundant Array of Independent Disks) and as checksum techniques when applied to memory. Checksum mechanisms employ a redundant DRAM and a checksum for data reconstruction when an error is detected. The checksum is obtained by forming the exclusive-or (XOR) operation between the data stored in a set of N DRAM blocks or MMs. The resultant checksum is then stored in a redundant MM or DRAM block, which has a capacity at least equal to the capacity of the other N DRAM blocks or MMs. More specifically, the data at each address, x, of each of the N MMs (or DRAM blocks) of data are XOR-ed to form a checksum that is stored in a corresponding address, x, of the redundant MM (or DRAM block). If a MM or DRAM block that contains data fails, then the data that was stored therein may be reconstructed by XORing the remaining DRAMS together with the checksum stored in the redundant DRAM block. This backup operation is typically performed by the PMD.
One prior art approach stores an entire memory line into each memory module. A disadvantage of this approach is that if a DRAM block fails, the entire process must be halted until the data of the failed DRAM block is reconstructed. Another disadvantage of this method is that in order to provide uniform access across all memory modules in the system, the DRAM used to store the checksum must be rotated among all of the DRAM blocks. This results in considerable additional complexity and computational overhead. It is also noted that the full bandwidth required for cache access is demanded of each DRAM block in this prior art approach.
Therefore, it is desirable to devise apparatus and methods for reconstructing lost data in real time without having to stop a process for reconstruction of lost data, and without having to rotate the checksum storage among different modules to achieve uniform bandwidth access.
SUMMARY OF THE INVENTION
An object of the present invention is therefore to provide methods and apparatus for reliable memory which do not require halting an application in operation to reconstruct lost data. Another object of the present invention is to provide uniform access of all memory modules in the memory system without increased complexity and computational overhead.
There are multiple approaches through which a line of memory can be stored into memory modules upon which checksum operations can be performed. A prior art approach is described in U.S. Pat. No. 4,849,978, which is incorporated herein by reference, which approach stores an entire memory line into each memory module. A disadvantage of this approach is that if a DRAM block fails, the entire process must be halted until the data of the failed DRAM block is reconstructed. Another disadvantage of this method is that in order to provide uniform access across all memory modules in the system, the DRAM used to store the checksum must be rotated among all of the DRAM blocks. This results in considerable additional complexity and computational overhead. It is also noted that the full bandwidth required for cache access is demanded of each DRAM block in this prior art approach.
The following inventive approach describes a system which need not be halted for reconstruction of data in a DRAM and in which the DRAM used to store the checksum need not be rotated among all the of the DRAM blocks.
The inventive approach is to store a slice of a memory line in each of N memory modules. According to one aspect of the present invention, a redundant memory slice is provided in addition to N data slices, where N is an integer. Each slice of memory may be implemented by separate DRAM chips. The redundant slice stores a checksum which may be used to reconstruct the data of any one of the N slices. The checksum is formed by XORing the N data slices together in a bit wise fashion. Thus, bit zero of the N data slices are XOR-ed together to produce bit zero of the redundant slice. Similarly, bit n of the redundant slice is created by XORing bit n of the N data slices. The XOR logical operator has the property that by XORing the checksum stored in the redundant slice with the data in N−1 of the data slices, the result will be the data that was stored in the remaining Nth data slice.
According to another aspect of th

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Memory error correction using redundant sliced memory and... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Memory error correction using redundant sliced memory and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Memory error correction using redundant sliced memory and... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2908980

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.