Failure detection and isolation

Error detection/correction and fault detection/recovery – Pulse or data error handling – Error count or rate

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C714S057000, C714S006130

Reexamination Certificate

active

06430714

ABSTRACT:

TECHNICAL FIELD
This invention relates to the detection of equipment failures on a disk array connected over a loop such as a Fibre Channel loop.
BACKGROUND OF THE INVENTION
Subsystems comprising disk arrays, i.e., groups of small, independent disk drive modules used to store large quantities of data have been developed and found to possess many advantages over a single large disk drive. For example, the individual modules of a disk array typically take up very little space and typically use less power and cost less than a single large disk drive, yet, when grouped together in an array, provide the same data storage capacity as a single large disk drive. In addition, the small disks of an array retrieve data more quickly than does a single large disk drive because, with a small disk drive, there is less distance for the actuator to travel and less data per individual disk to search through. The greatest advantage to small disk drives, however, is the boost they give to I/O performance when configured as a disk array subsystem.
On a disk array system, a failure in any one disk drive in the array will require identifying which of the many disk drives in the array was the cause of the problem. With the advent of communication loops for connecting the disk drives of an array, the need to identify and remove a faulty drive becomes particularly desirable as communications are being passed through the receiver and transmitter of each disk drive on the loop. Arbitrated loop protocols such as Fibre Channel are becoming popular for providing high speed communications in a disk array. A difficulty one may run into on a Fibre Channel disk array system is that while the standard failure identification is by target, in this case the logical unit (LUN) that received the I/O request from the host, any non-target drive in the Fibre loop may have actually perpetrated the error. More generally, in sending a data word from a host to one of the disks on the loop, the word must pass through receivers and transmitters in each of the disk drives electrically between the host and the target disk drive on the loop. If an error in the word is caused by any of the receivers or transmitters along the way, an error is reported by the target disk drive. While the system is aware of the error, it typically is not able to determine which of the disk drives on the loop was the cause of the error. Trial and error diagnostics need to be implemented in order to locate the faulty equipment.
SUMMARY OF THE INVENTION
Requests are made to each of the disk drives on a loop of disk drives for a count of errors so that an increase in the number of errors may be detected and reported. Detection of an invalid transmission word can take place at intermediate disk drives between an initiator sending the data word and the target drive. As such, detection of occurrences of an invalid transmission word can be used to identify faulty equipment, either receivers or transmitters, in disk drives that are located on a loop.
A loop of disk drives, such as a Fibre Channel loop, typically permit disk drives to initiate a loop initialization protocol (LIP). The loop initialization protocols are typically initiated upon adding a disk drive to a loop, upon power up or for error recovery. In order to assist and properly identify failed equipment on a loop of disk drives, a count of LIPS initiated and received by each disk drive is requested from the disk drives on the loop. The occurrences of LIP initiations and LIP receptions are synchronized with disk drive error requests and compared to identify disparities indicative of a failure on the loop. Also, any initiation of a certain type of LIP, which we shall refer to as an “error-indicating LIP”, is indicative of a failure in a disk drive or possibly its electrical predecessor. In accordance with a particular embodiment of the invention, initiation of any LIP by a disk drive is indicative of a failure in a disk drive or its electrical predecessor on the loop. Furthermore, when LIP receptions are identified at disk drives on a loop but no corresponding LIP initiation is identified, the equipment failure might not be a disk drive, but rather from other equipment on the loop such as a host bus adapter.
In accordance with an embodiment of the invention, the error count may include both the amount of invalid transmission words and the number of loop initialization protocols initiated and received. All such counts may be requested over the Fibre Channel loop from each disk drive on the loop. The baseline count is achieved in a first request. A second request for the counts permits the detection of changes in the counts on the disk drives in the loop. If no LIPs have occurred, the change in error count is used to identify a suspect disk drive. Also, the electrical predecessor on the loop is recorded since an error may have been caused by the transmitter of the predecessor or the receiver of the error detecting disk drive. When LIPs are detected, they are used to help locate the source of the errors. The methods of the present invention may be embodied on a computer program product for use on a computer system.
Embodiments of the invention advantageously achieve early and quick detection of failed equipment on the loop. The LIP counts may advantageously identify a non-disk drive error and thus save the time and effort in doing a trial and error diagnostic at each of the disk drives in the loop. Furthermore, by making the initiation of any LIP indicative of an equipment failure in a disk drive or its electrical predecessor, earlier detection of failed equipment is made possible.
Other objects and advantages of the invention will become apparent during the following description of the presently preferred embodiments of the invention taken in conjunction with the drawings.


REFERENCES:
patent: 3622984 (1971-11-01), Eastman
patent: 3704363 (1972-11-01), Salmassy et al.
patent: 5638518 (1997-06-01), Malladi
patent: 5666512 (1997-09-01), Nelson et al.
patent: 5802080 (1998-09-01), Westby
patent: 5890214 (1999-03-01), Espy et al.
“Fibre Channel and Related Standards” Martin Sachs IEEE Communications Magazine Aug. 1996 pp. 40-50.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Failure detection and isolation does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Failure detection and isolation, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Failure detection and isolation will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2928654

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.