Error detection/correction and fault detection/recovery – Pulse or data error handling – Digital data error correction
Reexamination Certificate
1999-10-28
2002-12-10
Decady, Albert (Department: 2133)
Error detection/correction and fault detection/recovery
Pulse or data error handling
Digital data error correction
C714S774000
Reexamination Certificate
active
06493843
ABSTRACT:
CROSS REFERENCE TO RELATED APPLICATION
The applications entitled “Self-Healing Memory System for High Availability Server”, identified by HP Docket Number 10991629 and the inventor Michael B. Raynham and James G. Mathios, filed Oct. 28, 1999 and “Radial Arm Memory Bus for a High Availability Computer System” identified by HP Docket Number 10991678 and the inventors Michael B. Raynham and Hans Wiggers, filed Oct. 28, 1999 include subject matter related to the copending application.
BACKGROUND OF THE INVENTION
Available memory systems are constantly expanding in size with time, with current server memory systems often being in the range of up to 64 Gbytes (approximately half a trillion storage bits) or larger for high end servers. Dependent on the customer requirements, the customer may choose to purchase a low end or high end server. Current low end server systems typically include one to four dual in line memory modules (DIMMs) while a high end servers typically include four or more DMMs. Typically, high end servers also include high availability features such as memory chip redundancy, hot swapping, and the ability to do chipkill error correction.
Referring to
FIG. 1A
shows a side view of a memory system for a low end server system
100
. The memory system includes a CPU or memory controller
102
affixed to a motherboard
106
and two dual in line memory modules
108
a
and
108
b
. The two memory modules
108
a-b
shown each includes N memory devices
112
connected in parallel. Assuming for purposes of discussion that N is equal to eighteen, the eighteen memory devices
112
a
-N on each memory module
108
a
and
108
b
are connected to the memory controller
102
by a data bus
114
, which includes board trace portions
116
, a connectors
118
and a module trace portions
120
.
FIG. 1B
shows a block diagram of the memory structure of the memory modules of the low end server shown in FIG.
1
A. In the embodiment shown in
FIG. 1B
, the data bus is
72
bits wide where 64 bits are used for data and 8 bits are used for error correction. Each of the eighteen memory devices on the memory module
108
a-b
is 4 bits wide and for a 256 Mbyte system each of the eighteen×4 SDRAMs is 32 Mbyte. The eighteen×4 memory devices are connected in parallel so that for each memory operation, the output onto the data bus
114
is 72 bits wide.
Referring to
FIG. 1C
shows a clock pulse for reading or writing to a memory location of the low end server shown in FIG.
1
A. The memory controller reads a single word or memory location from a single memory module at a time. Assuming a single data rate (SDR) system and a read operation, the memory location in memory module
108
having the address 000000 is read at the clock edge t
1
. The contents of the memory location is 72 bits wide. No memory operation occurs at clock edge t
2
. A second memory location having the address location 000001 in memory module
108
is read at the clock edge t
3
.
FIG. 2A
shows a side view of a memory system for a conventional high end server system
200
having eight DIMM modules. Similar to the low end server configuration shown In
FIG. 1A
, the memory system shown in
FIG. 2A
includes a CPU or memory controller
202
affixed to a motherboard
206
. However, the high end configuration Includes eight dual in line memory modules
208
a-h
instead of the two DIMMs
108
a-b
shown in FIG.
1
A. The eight memory modules
208
a
shown each includes N memory devices
212
, The memory controller
202
is connected to the eight memory modules
208
a-h
by a data bus
214
, which includes board trace portions
216
, connectors
218
and module trace portions
220
.
FIG. 2B
shows a block diagram of the memory structure of the high end server shown in FIG.
2
A. In the high end server shown, the data bus is 144 bits wide where 128 bits are used for data and 16 bits are used for error correction. Preferably each memory module Includes eighteen memory devices (N=18), each memory device being 4 bits wide. For each memory module, the eighteen×4 memory devices are connected in parallel. Data is read from two memory modules simultaneously, so that for each memory operation, the output onto the data bus
214
is 144 bits wide.
Referring to
FIG. 2C
shows a clock pulse for reading or writing to a memory location of the high end server shown in FIG.
2
A. The memory controller reads a single word or memory location from a single memory module at a time. Assuming a double data rate (DDR) system and a memory read operation, the memory location in memory module
208
a
having the address 000000 and the memory location in memory module
208
e
having the address 000000 are both read simultaneously at the clock edge t
1
. A second memory location in memory module
208
a
having the address location 000001 and a second memory location having the address 000001 in memory module
208
d
are both read simultaneously at the clock edge t
2
.
The current trend of increasing memory size is likely to continue. Microprocessor suppliers continue to supply higher speed CPUs. With increases in CPU speed come increased speed in the CPU bus and supporting I/O systems and a corresponding increase in server memory size per CPU since more users per CPU can be supported. As the size of memory systems increases, the probability of a memory bit failing, and thus the memory system failing, increases. Customers are demanding improved error correction features to deal with these increases in memory failures even for low end systems.
One error correction feature that was traditionally not supported in low servers is what is known in the industry as chipkill. The term chipkill traditionally refers to the ability to correct multiple bit errors in memory, where the multiple bit error is the width of the memory device. For example, for a 32 Mbit SDRAM that is 4 bits wide, a system that supports the chipkill function would be able to correct a 4 bit wide error in the memory device. Thus, the failure of an entire SDRAM chip organized in a×4 configuration in a system that supports chipkill would not cause the system to fail.
Chipkill is provided in high end chipsets, for example, by combining two DIMMs into a 144 bit bus that includes 128 data bits and 16 ECC bits where ECC stands for error correcting or error checking and correcting codes. The number of bits that can be corrected, typically depends on the number of ECC bits supported by the system. ECC or error correction code refers to a commonly used error detection and correction process that is typically based on a CRC (cyclic redundancy code) algorithm. CRC algorithms work so that when data is received, the complete data sequence (which includes CRC bits appended to the end of the data field) are read by a CRC checker. The complete data sequence should be exactly divisible by a CRC polynomial. If the complete data sequence is not divisible by a CRC polynomial, an error is deemed to have occurred.
Supporting the chipkill function based on an ECC process typically requires additional error correction bits where the number of bits corrected depends on the number of ECC bits supported by the system. For example, typically the CRC algorithm used to correct for a 4 bit wide memory organization requires more than the eight error correction bits that are provided by the low end server shown in FIG.
1
A. Thus, to perform the chipkill function for a×4 organization, the low end server would require additional memory modules or devices to provide the additional ECC bits necessary to perform the required CRC algorithms. However, because the CRC algorithm typically used to correct for a 4 bit wide memory organization does not require more than the 16 error correction bits supported by the high end server shown in
FIG. 2A
, the high end server could support the chipkill function.
An alternative implementation available to low end server systems that wish to provide chip kill error correction is to provide a custom ASIC, such as that currently made commercially available from IBM Cor
De'cady Albert
Torres Joseph D.
LandOfFree
Chipkill for a low end server or workstation does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Chipkill for a low end server or workstation, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Chipkill for a low end server or workstation will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2996747