Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1998-10-22
2002-07-09
Le, Dieu-Minh (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S006130
Reexamination Certificate
active
06418539
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to reliable electronic systems. More particularly, but without limitation, the present invention relates to highly reliable computer disk drive memory systems, wherein reliability is obtained through the use of redundant components.
2. Description of Related Art
Various types of computer memory storage units are used in data processing systems. A typical system may include one or more disk drives (e.g., magnetic, optical or semiconductor) connected to the system's central processing unit (“CPU”) through respective control devices for storing and retrieving data as required by the CPU. A problem exists, however, if one of the subsystems within the storage unit fails such that information contained in the storage unit is no longer available to the system. Such a failure may shut down the entire data processing system.
The prior art has suggested several ways of solving the problem of providing reliable data storage. In systems where data records are relatively small, it is possible to use error correcting codes (“ECC”) which are appended to each data record within a storage unit. With such codes, it is possible to correct a small amount of data. However, such codes are generally not suitable for correcting or recreating long records which are in error, and provide no remedy at all for the complete failure of an entire disk drive, for example. Therefore, a need exists for providing data reliability external to individual disk drives.
Redundant disk array systems provide one solution to this problem. Various types of redundant disk array systems exist. In a paper entitled “A Case for Redundant Arrays of Inexpensive Disks (RAID)”, Proc. ACM SIGMOD, June 1988, Patterson et al., cataloged a number of different types of disk arrays and defined five distinct array architectures under the acronym “RAID,” for Redundant Array of Inexpensive Disks.
A RAID 1 architecture involves the use of duplicate sets of “mirrored” disk drives, i.e., keeping duplicate copies of all data on pairs of disk drives. While such a solution partially solves the reliability problem, it almost doubles the cost of data storage. Also, once one of the mirrored drives fails, the RAID 1 architecture can no longer withstand the failure of a second mirrored disk while still maintaining data availability. Consequently, upon the failure of a disk drive, the RAID 1 user is at risk of losing data.
Such systems as those described above have been designed to be easily serviceable, so as to help minimize the total amount of time required for detecting the failed disk drive, removing and replacing the failed drive, and copying data from the remaining functional disk to the replacement disk to again provide a redundant storage system. Nevertheless, in some circumstances where a customer detecting a failed disk drive must secure the assistance of a service engineer, the time elapsed from detection of the failure to complete data redundancy can be as long as twenty-four hours or more. During all this time, the user is exposed to the possibility of data loss if the sole remaining mirrored disk drive fails.
In an attempt to reduce this “window of vulnerability,” some manufacturers have equipped their storage array disk drive products with a spare disk drive. The spare disk drive is not used during normal operation. However, such systems are designed to automatically detect the failure of a disk drive and to automatically replace the failed disk drive with the spare disk drive. As a practical matter, replacement usually occurs by automatically turning off the failed drive and logically replacing the failed drive with the spare drive. For example, the spare drive may be caused to assume the logical bus address of the failed drive. Data from the functioning disk is then copied to the spare disk. Since this automatic failure detection and replacement process can typically be accomplished within a fairly short period of time (on the order of minutes), the window of vulnerability for automated systems is greatly reduced. Such techniques are known in the disk drive industry as “hot sparing.”
Immediately following the hot sparing process, the disk array system, although fault tolerant, can no longer sustain two disk failures while maintaining data availability. Therefore, the degree of fault tolerance of the system is compromised until such time as the customer or a service engineer physically removes the failed disk drive and replaces the failed disk drive with an operational disk drive.
As previously mentioned, in addition to RAID 1, there are also RAID levels 2-5. Although there are significant differences between the various RAID levels, each involves the technique of calculating and storing encoded redundancy values, such as hamming codes and parity values, for a group of disks on a bit-per-disk basis. With the use of such redundancy values, a disk array can compensate for the failure of any single disk simply by reading the remaining functioning disks in the redundancy group and calculating what bit values would have to be stored on the failed disk to yield the correct redundancy values. Thus, N+1 RAID (where N=total number of disks containing data in a single redundancy group) can lose data only if there is a second disk failure in the group before the failed disk drive is replaced and the data from the failed drive recreated on the replacement disk.
Redundant disk storage increases overall data availability for data stored on the memory system. However, failure of other parts of the memory system can also compromise data availability. For example, failure of the power, cooling or controller subsystems forming part of the computer memory storage unit may cause stored data to become unavailable.
Redundant power systems are known wherein a single disk array is provided with two power supply subsystems, each being capable of powering the entire array. During normal operation, each power supply supplies one-half of the overall power requirements of the array. However, upon the failure of either power supply, the remaining power supply provides all power to the array until the failed power supply is replaced.
Similarly, redundant cooling systems are also known. For example, two fans may normally cool the entire disk array system. Upon the failure of either fan, the rotational speed of the remaining fan increases so that the remaining fan maintains the temperature of the system within tolerable limits until such time as the defective fan is replaced.
Array Technology Corporation of Boulder Colorado has offered dual controller RAID storage systems to the commercial market. Upon the failure of one of the dual RAID controllers, the host CPU can continue to access data stored on any disk of the array through the other controller. Thus, the Array Technology Corporation disk array system can tolerate the failure of a single controller without compromising data availability.
During recent years, the cost of the physical components for disk drive systems has been decreasing. However, the cost of labor, and in particular the cost for service, is increasing and can be expected to continue to increase. In fact, over the commercially useful life of a disk drive system (typically about 5-10 years), service costs can be expected to meet or exceed the initial purchase price of the system.
Many highly available redundant disk storage systems are designed such that the components which are subject to failure can be easily removed and replaced, either by the customer or a field engineer. Unfortunately, however, building a disk storage system wherein components are serviceable significantly increases the design and manufacturing costs and hence the cost to the customer. For example, serviceable components must be built with more expensive blind mateable plugs and sockets for electrically interconnecting parts wherein such connectors are not easily accessible, for highest availability the overall system must be designed to allow removal and installation of su
Compaq Computer Corporation
Fenwick & West LLP
Le Dieu-Minh
LandOfFree
Continuously available computer memory systems does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Continuously available computer memory systems, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Continuously available computer memory systems will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2898768