Hot spare reliability for storage arrays and storage networks

Electrical computers and digital processing systems: memory – Storage accessing and control – Specific memory composition

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S006130, C714S006130

Reexamination Certificate

active

06732233

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to disk drive systems, and more specifically to an apparatus and method for improving the reliability of hot spare disk drives in a storage array/storage network.
BACKGROUND OF THE INVENTION
RAID (Redundant Array of Independent Disks) is a set of methods and algorithms for combining multiple disk drives (i.e., a storage array) as a group in which attributes of the multiple drives are better than the individual disk drives. RAID can be used to improve data integrity (i.e., reduce the risk of losing data due to a defective or failing disk drive), cost, and/or performance.
RAID was initially developed to improve I/O performance at a time when computer CPU speed and memory size was growing exponentially. The basic idea was to combine several small inexpensive disks (with many spindles) and stripe the data (i.e., split the data across multiple drives), such that reads or writes could be done in parallel. To simplify the I/O management, a dedicated controller would be used to facilitate the striping and present these multiple drives to the host computer as one logical drive.
The problem with this approach was that the small, inexpensive PC disk drives of the time were far less reliable than the larger, more expensive drives they replaced. An artifact of striping data over multiple drives is that if one drive fails, all data on the other drives is rendered unusable. To compound this problem, by combining several drives together, the probability of one drive out of the group failing increased dramatically.
In order to overcome this pitfall, extra drives were added to the RAID group to store redundant information. In this way, if one drive failed, another drive within the group would contain the missing information, which could then be used to regenerate the lost information. Since all of the information was still available, the end user would never be impacted with down time and the rebuild could be done in the background. If users requested information that had not already been rebuilt, the data could be reconstructed on the fly and the end user would not know about it.
Today there are six base architectures (levels) of RAID, ranging from “Level 0 RAID” to “Level 5 RAID”. These levels provide alternative ways of achieving storage fault tolerance, increased I/O performance and true scalability. Three main building blocks are used in all RAID architectures: 1) Data Striping—Data from the host computer is broken up into smaller chunks and distributed to multiple drives within a RAID array. Each drive's storage space is partitioned into stripes. The stripes are interleaved such that the logical storage unit is made up of alternating stripes from each drive. Major benefits are improved I/O performance and the ability to create large logical volumes. Data striping is used in Level 0 RAID. 2) Mirroring—Data from the host computer is duplicated on a block-to-block basis across two disks. If one disk drive fails, the data remains available on the other disk. Mirroring is used in RAID levels 1 and 1+0. 3) Parity—Data from the host computer is written to multiple drives. One or more drives are assigned to store parity information. In the event of a disk failure, parity information is combined with the remaining data to regenerate the missing information. Parity is used in RAID levels 3, 4 and 5.
If a drive fails in a RAID array that includes redundancy—meaning all RAID architectures with the exception of RAID 0—it is desirable to get the drive replaced immediately so the array can be returned to normal operation. There are two reasons for this: fault tolerance and performance. If the drive is running in a degraded mode due to a drive failure, until the drive is replaced, most RAID levels will be running with no fault protection at all: a RAID 1 array is reduced to a single drive, and a RAID 3 or RAID 5 array becomes equivalent to a RAID 0 array in terms of fault tolerance. At the same time, the performance of the array will be reduced, sometimes substantially.
An extremely useful RAID feature that helps alleviate this problem is hot swapping, which when properly implemented, will let a user replace the failed drive immediately without taking down the system. Another approach is through the use of hot spares. Additional drives are attached to the controller and left in a “standby” mode. If a failure occurs, the controller can use the spare drive as a replacement for the bad drive. This simple concept is supported by most RAID implementations. Even many of the inexpensive hardware RAID cards and software RAID solutions support this approach. Typically, the only cost is another hard disk that has to be purchased but cannot be used for storing data.
The main advantage that hot sparing has over hot swapping is that with a controller that supports hot sparing, the rebuild will be automatic. The controller detects that a drive has gone bad, it disables it, and immediately rebuilds the data onto the hot spare. This is a tremendous advantage for anyone managing many arrays, or for systems that run unattended. As features, hot sparing and hot swapping are independent: you can have one, or the other, or both. They will work together, and often are used in that way. However, sparing is particularly important if hot swap (or warm swap) capability is not available, because it will enable a user to get the array back into normal operating mode quickly, delaying the time that the system has to be shut down until it is more convenient. However, when this occurs, the user loses the hot sparing capability in the meantime. When the failed drive is eventually replaced, the new drive becomes the new hot spare.
It has been discovered that hot spares can fail shortly after they are called into action. The reason for this appears to be caused by the distinctly different access patterns used for spares as compared to active drives. Thus, when a spare drive begins using areas of the disk stack not previously accessed during its idle routine, a head crash may result.
In order to address this problem, manufacturers of RAID systems have implemented a “safe mode” for priming the hot spare before it is called into service. In safe mode, when the hot spares are in idle mode, the heads are moved across the disk surfaces at some periodic interval. The problem with safe mode is that it does not emulate the track utilization present in the active drives of an array.
There is a need for a more realistic apparatus and method to prime a hot spare drive prior to use. The apparatus and method should emulate the track utilization present in the active drives of the array prior to being called into service.
The details of the present invention, both as to its structure and operation, can best be understood in reference to the accompanying drawings, in which like reference numerals refer to like parts.
SUMMARY
A method and apparatus for improving the reliability of hot spare disk drives in disk arrays and storage networks is provided. In a preferred embodiment, the present invention intercepts commands issued to active disk drives in a disk array/storage network, analyzes the commands, and issues commands to the hot spare drives which attempt to emulate the track usage patterns of the active drives. The track usage patterns can be inferred from examining logical block addresses (LBA) of data stored to the active drives, and/or the ratio of read versus write commands issued to the active disk drives. By emulating track usage patterns of the active drives, the hot spare drives have a roughly equivalent lubricant distribution to that of the active disk drives. This provides increased reliability when the hot spare drives are later called into active service.
The present invention also provides a storage array/storage network apparatus having a plurality of active disk drives for storing data. The storage array/storage network apparatus also provides one or more hot spare disk drives for replacing any of the active disk drives if a failure occurs on any of the active di

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Hot spare reliability for storage arrays and storage networks does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Hot spare reliability for storage arrays and storage networks, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Hot spare reliability for storage arrays and storage networks will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3236798

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.