Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1999-10-07
2002-10-15
Le, Dieu-Minh (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S054000
Reexamination Certificate
active
06467048
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to computer systems, and more particularly, to high availability computer systems having error detecting and correcting random access memory, at least one cache memory and a fail-over replacement system for defective portions of the random access memory.
2. Description of the Related Technology
Use of computers, especially personal computers, is becoming more and more pervasive because the computer has become an integral tool of most information workers who work in the fields of accounting, law, engineering, insurance, services, sales and the like. Rapid technological improvements in the field of computers have opened up many new applications heretofore unavailable or too expensive for the use of older technology mainframe computers. These personal computers may be used as stand-alone workstations (high end individual personal computers) or linked together in a network by a “network server” which is also a personal computer which may have a few additional features specific to its purpose in the network.
The network server may be used to store massive amounts of data, and may facilitate interaction of the individual workstations connected to the network for electronic mail (“e-mail”), document databases, video teleconferencing, whiteboarding, integrated enterprise calendar, virtual engineering design and the like. Multiple network servers may also be interconnected by local area networks (“LAN”) and wide area networks (“WAN”).
As users become more and more dependent on computers, the requirements for the computer system remaining operational when most needed is of paramount importance. An unplanned service outage because of a computer server crash may leave customers waiting in line at checkout counters, doctors unable to obtain patient data, on:line users unable to log onto a network; an office slowdown or even shutdown because documents, e-mail, accounting information, Internet web page hosting becoming inaccessible, etc.
The network servers are being widely used in mission critical business, scientific and government applications by, for example, tying together the personal computer workstations into a network (LAN and WAN), and for storing and/or forwarding critical information. Software applications such as databases that run on these servers are becoming more memory intensive than ever before. The memory systems of these servers are continually becoming larger in order to handle the more demanding software application programs and files associated therewith. At the same time, rapidly advancing electronics technologies enable microprocessors and associated memory devices to run at ever faster clock speeds using lower voltages. The lower voltage creates a lower data signal noise margin, and the higher clock speeds exasperate noise conditions. As a direct result, the computer system environmental noise becomes a more significant factor and data is more vulnerable to errors cause by transient electrical and electromagnetic phenomena that can corrupt the data stored in the memory subsystem.
When a memory error does occur, a server should not lose critical data or crash. A server may employ an error checking and correcting (ECC) logic circuit to improve data integrality and thus data availability by detecting and correcting “soft” data errors within the memory subsystem. Error detection and correction allows the server memory subsystem to operate continuously, and to be available as long as the detected errors are correctable by the ECC logic circuit. However, a memory address of the memory subsystem that has experienced excessive ECC soft errors is more likely to continue generating errors and the severity of these errors may increase to the point where the ECC logic circuit can no longer correct all of the errors. At the point of being unable to correct all of the errors, the server may crash.
Defective portions of a memory module(s) (i.e., having an excessive amount of errors) have been replaced or bypassed in the memory subsystem by marking the section (for example 128 KB) of faulty memory (due to excessive or non-correctable errors). Then the server would need to be shutdown and then restarted, without the section of faulty memory mapped into the computer system address space, thus, a network outage is required and a subsequent reduction in system memory capacity results.
Fully redundant memory subsystems may be used and when excessive errors occur in one it is marked as defective. When the server is restarted the defective memory is not mapped as useable memory, however, half of the system memory is no longer functional and system performance suffers.
A standby hot fail-over memory system allows the memory controller to fail-over to a standby memory module the data stored in the memory module having errors before an uncorrectable error happens. This fail-over system, however, allows only one memory module to be replaced. It cannot solve the problem of errors coming from multiple memory modules. An additional memory module must be designated as the standby memory module. It takes a longer time for the fail-over process to complete since all of the data stored in the failing memory module must be transferred to the standby memory module. The fail-over time is dependant upon the memory size and the actual memory traffic that is generated during the fail-over process. The standby hot fail-over memory system is more fully described in commonly owned U.S. patent application Ser. No. 08/763,411, filed Dec. 11, 1996, entitled “Failover Memory for a Computer System” by Sompong P. Olarig, and is incorporated by reference herein.
A fast fail-over memory allows the memory controller to support multiple memory address failures while the computer system is running before an uncorrectable error occurs. The fast fail-over memory system requires a portion of additional standby memory space to function. If there are no memory errors, the fail-over standby memory is not being used. The fast fail-over memory system is more fully described in commonly owned U.S. patent application Ser. No. 09/116,714, filed Jul. 16, 1998, entitled “Fail-Over of Multiple Memory Blocks in Multiple Memory Modules in a Computer System” by Sompong P. Olarig, and is incorporated by reference herein.
The processor or plurality of processors in a computer system run in conjunction with a high capacity, low-speed (relative to the processor speed) main memory, and a low capacity, high-speed (comparable to the main memory speed) cache memory or memories (one or more cache memories associated with each of the plurality of processors).
Cache memory is used to reduce memory access time in mainframe computers, minicomputers, and microprocessors. The cache memory provides a relatively high speed memory interposed between the slower main memory and the processor to improve effective memory access rates, thus improving the overall performance and processing speed of the computer system by decreasing the apparent amount of time required to fetch information from main memory
In today's single and multi-processor computer systems, there is typically at least one level of cache memory for each of the processors. The latest microprocessor integrated circuits may have a first level cache memory located in the integrated circuit package and closely coupled with the central processing unit (“CPU”) of the microprocessor. Additional levels of cache may also be implemented by adding fast static random access memory (SRAM) integrated circuits and a cache controller. Typical secondary cache size may be any where from 64 kilobytes to 8 megabytes and the cache SRAM has an access time comparable with the processor clock speed.
In common usage, the term “cache” refers to a hiding place. The name “cache memory” is an appropriate term for this high speed memory that is interposed between the processor and main memory because cache memory is hidden from the user or programmer, and thus appears to be transparent. Cache memory, serving as a fast storage buffer
Carbajal Christopher M.
Jenne John E.
Olarig Sompong P.
Compaq Information Technologies Group L.P.
Conley & Rose & Tayon P.C.
Le Dieu-Minh
LandOfFree
Apparatus, method and system for using cache memory as... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus, method and system for using cache memory as..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus, method and system for using cache memory as... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2919409