Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
1999-12-03
2002-11-26
Beausoleil, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S042000, C714S055000
Reexamination Certificate
active
06487680
ABSTRACT:
FIELD OF THE INVENTION
This invention relates generally to data storage systems. More particularly, the invention relates to the management of a data storage system by multiple disk array controllers in an n-way active configuration, such that a disk array controller can detect the failure of and reset one or more other disk array controllers in the data storage system.
BACKGROUND OF THE INVENTION
Disk drives in all computer systems are susceptible to failures caused, for example, by temperature variations, head crashes, motor failure, controller failure, and changing supply voltage conditions. Modem computer systems typically require, or at least benefit from, a fault-tolerant data storage system, for protecting data in the data storage system against any instances of data storage system component failure. One approach to meeting this need is to provide a redundant array of independent disks (RAID) operated by a disk array controller (controller).
A RAID system typically includes a single standalone controller, or multiple independent controllers, wherein each controller operates independently with respect to the other controllers. A controller is generally coupled across one or more input/output (I/O) buses both to a rack of disk drives and also to one or more host computers. The controller processes I/O requests from the one or more host computers to the rack of disk drives. Such I/O requests include, for example, Small Computer System Interface (SCSI) I/O requests, which are known in the art.
Such a RAID system provides fault tolerance to the one or more host computers, at a disk drive level. In other words, if one or more disk drives fail, the controller can typically rebuild any data from the one or more failed disk drives onto any surviving disk drives. In this manner, the RAID system handles most disk drive failures without interrupting any host computer I/O requests.
Consider what would happen if a controller in a single controller system failed—the entire data storage system would become inoperable. And, although failure of a single controller in a data storage system that is being managed by multiple independent controllers will not typically render the entire RAID system inoperable, such a failure will render the tasks that were being performed by the failed controller, and/or those tasks scheduled to be performed by the failed controller, inoperable. In light of the above, it can be appreciated that it is not only desirable for a data storage system to reliably function in the instance that a disk drive failure occurs, but it is also desirable for the data storage system to reliably function with any type of failed component, including a failed controller.
To provide fault tolerance to a data storage system at a controller level, data storage systems managed by two controllers in dual active configuration were implemented. Referring to
FIG. 1
, there is shown data storage system
100
being managed by two controllers
102
and
104
in dual active configuration, according to the state-of-the-art. Controllers
102
and
104
are coupled across first peripheral bus
106
, for example, an optical fiber, copper coax cable, or twisted pair (wire) bus, to a plurality of storage devices, for example, disk drives
108
-
112
, in peripheral
114
. Controllers
102
and
104
are also coupled across a second peripheral bus
116
, for example, an optical fiber, copper coax cable, or twisted pair (wire) bus, to one or more host computers, for example, host computer
118
.
From the viewpoint of controller
102
, controller
104
is its partner controller, and from the viewpoint of controller
104
, controller
102
is its partner controller. To determine when a partner controller has failed, controllers
102
and
104
are connected across ping cable
120
. Each respective controller
102
and
104
is responsible for sending ping messages to the other controller
102
or
104
across ping cable
120
.
Receipt of a ping message by a controller
102
or
104
from a partner controller
102
or
104
, informs the receiving controller
102
or
104
that the partner controller
102
or
104
is alive, and not malfunctioning from a hardware problem or another problem. For example, when a particular controller
102
or
104
stops receiving ping messages from its partner controller
102
or
104
for a predetermined amount of time, the particular controller
102
or
104
determines that the partner controller
102
or
104
, in some manner, has failed.
In the event that a controller
102
or
104
fails, the surviving controller
102
or
104
will take over the tasks that were being performed by the failed controller
102
or
104
. Additionally, the surviving controller
102
or
104
will perform those tasks that were scheduled to be performed by the failed controller
102
or
104
. Additionally, if the failure is of a type for which reset is an adequate solution, the surviving controller
102
or
104
will typically attempt to reset the failed controller
102
or
104
by sending it a reset signal across a reset line
122
. Such reset signals are known. (It can be appreciated that, in some instances, the failed controller
102
or
104
may require replacement or repair so that a reset by a surviving controller
102
or
104
may be inadequate.)
Consider that the failure of both controllers
102
and
104
would destroy the fault tolerance and functionality of data storage system
100
. It would be advantageous and desirable to manage a data storage system with more than two controllers (as in the above described dual active controller configuration), such that at least two controllers could fail before such fault tolerance and functionality of a data storage system is destroyed.
A significant problem with the state of the art, is that it does not provide any system, structure or method for a controller
102
or
104
to detect the failure of, or reset any controller
102
or
104
other than a single partner controller
102
or
104
. To illustrate this, consider that ping cable
120
and reset line
122
are hardwired between controllers
102
and
104
, such that respective controllers
102
and
104
can only detect the failure of and reset a partner controller
102
or
104
.
For more than two controllers to manage a data storage system in active controller configuration, each respective controller would require an ability to detect and reset more than just a single other controller. According to state of the art methodologies for detecting the failure of a partner controller
102
or
104
, such a controller
102
or
104
would need to be implemented to accommodate more than just one respective ping cable and reset line to detect any failures and reset more than just a single other controller in the data storage system. The design and implementation of such a backplane would typically add additional expense to the cost of a controller and a data storage system. Additionally, significant manual intervention, by a human system administrator, may be required to add and connect such ping cables and reset lines between the controllers, possibly even necessitating the system to be shut-down during such intervention.
Therefore, there is a need for a data storage system that is managed by more than just two controllers in active controller configuration. There is a need for each controller in such a data storage system to be able to detect the failure of and reset more than just a single other partner controller in the data storage system. To accomplish this, it is desirable that such a controller will not require a redesign of the controller's backplane to accommodate an arbitrary number of ping cables and reset lines.
REFERENCES:
patent: 5546535 (1996-08-01), Stallmo et al.
patent: 5553230 (1996-09-01), Petersen et al.
patent: 5699510 (1997-12-01), Petersen et al.
patent: 5768623 (1998-06-01), Judd et al.
patent: 5841969 (1998-11-01), Fye
patent: 5975738 (1999-11-01), DeKoning et al.
patent: 5996075 (1999-11-01), Matena
patent: 6035416 (2000-03-01), Abdelnour et al
Otterness Noel S.
Skazinski Joseph G.
Beausoleil Robert
International Business Machines - Corporation
Wilson Yolanda
LandOfFree
System, apparatus, and method for managing a data storage... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System, apparatus, and method for managing a data storage..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System, apparatus, and method for managing a data storage... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2988337