Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability
Reexamination Certificate
2001-05-22
2004-07-06
Beausoliel, Robert (Department: 2184)
Error detection/correction and fault detection/recovery
Data processing system error or fault handling
Reliability and availability
C714S016000
Reexamination Certificate
active
06760862
ABSTRACT:
BACKGROUND OF THE INVENTION
A typical data storage system stores and retrieves data for one or more external host devices (or simply hosts). Such a data storage system typically includes processing circuitry and a set of disk drives. In general, the processing circuitry performs load and store operations on the set of disk drives on behalf of the hosts, e.g., block I/O operations using SCSI communications, ESCON communications, Fibre Channel signals, etc.
On occasion, the data storage system may require servicing by a technician. To this end, the technician typically goes to the location where the data storage system resides, and performs a service procedure on the data storage system. For example, the system may require a hardware or software upgrade in order to integrate a design improvement or to fix a design defect. As another example, a circuit board of the processing circuitry or a disk drive may fail and require replacement.
To assist the technician in performing such service procedures, some data storage system manufacturers provide scripts that automate the servicing process. That is, in response to a few electronically entered commands (e.g., instructions typed into a data storage system console device by the technician), the scripts perform a more-detailed and more-complex series of operations. As a result, without extensive knowledge of low-level aspects of the data storage system, the technician can perform a variety of service operations on the data storage system such as upgrading hardware or software, or replacing a defective data storage part by simply providing a few commands (e.g., typing information at a keyboard) and performing some physical work (swapping a failed component with a new component).
For example, suppose that a disk drive of a data storage system fails. A technician can travel to the data storage system and, at the console device of the data storage system, run a conventional script that guides the technician through a disk drive replacement procedure in an automated manner. For one conventional type of data storage system, the script first requires the technician to identify a spare disk drive for use in recovering data on the failed disk drive. After the technician identifies the spare disk drive, the script performs a data recovery procedure to recover the data. Such a recovery procedure may simply involve copying data from a mirror disk drive to the spare disk drive or, alternatively, involve more extensive data recovery operations (e.g., performing a series of logical XOR operations to recover data from related data and parity information). After the data is restored onto the spare disk drive, the script directs the technician to physically remove the failed disk drive and replace it with a new disk drive. After the technician physically replaces failed disk drive with the new disk drive, the script checks the new disk drive to make sure it has an appropriate size (e.g., that the new disk drive is at least as large as the failed disk drive). Next, the script copies the recovered data from the spare disk drive to the new disk drive. Once the data resides on the new disk drive, the script gives back the spare disk drive so that it can be used for other purposes and the disk drive replacement process is complete.
The technician can perform other types of service procedures using other conventional scripts that automate those service procedures in a manner similar to that described above for replacing a disk drive. Other examples of conventional script-driven service procedures include those for upgrading hardware or replacing failed hardware (e.g., circuit boards, etc.) and those for upgrading software (e.g., operating systems, device drivers, application level programs, etc.).
SUMMARY OF THE INVENTION
Unfortunately, there are deficiencies to using the above-described conventional scripts that automate servicing processes. For example, such scripts typically expect a service procedure to complete successfully, or if stopped before completion, to be restarted from the beginning. However, many conventional service procedures can fail in the middle leaving the data storage system in an intermediate state. When in such a state, the service procedure may not work properly if restarted because the service procedure may needed certain parameters of the data storage system to be at certain values which have since changed to values that will cause the service procedure to operate improperly.
For example, suppose that a technician travels to a customer site to replace a bad disk drive of a data storage system. Upon arrival suppose that the technician boots the console device of the data storage system and invokes a disk drive replacement script which is designed to enable the technician to (i) allocate an available spare disk drive and recover data onto the spare disk drive (e.g., copy data from a disk drive that mirrors the failed disk drive), (ii) replace the failed disk drive with a new disk drive, (iii) subsequently transfer the recovered data from the spare disk drive to the new disk drive, and (iv) finally return the spare disk drive to its initially available condition.
The technician may arrive at the customer site and successfully recover the data of the failed disk drive onto a spare disk drive. The technician may then replace the failed disk drive with a new disk drive. If the new disk drive works properly, the technician can then transfer the recovered data to the new disk drive and then return the spare disk drive to complete the service procedure.
However, suppose that the new disk drive was itself defective, i.e., another failed disk drive. Further suppose that the technician does not posses another new disk drive to swap in place of the faulty new disk drive. In this situation, the technician typically leaves the data storage system with the replacement procedure running, and travels back to the office to retrieve another new disk drive. In the meantime, the data storage system may reboot the console device since some data storage systems are programmed to reset a component (e.g., the console device) if there has not been any activity from that component after a predetermined period of time (e.g., 30 minutes).
When the technician returns with the new disk drive, the technician finds that the console device has been rebooted and that the script for replacing a failed disk drive terminated in the middle. If the technician restarts the script, the script would operate improperly. In particular, the script would start at the beginning and require the technician to allocate a spare disk drive. Unfortunately, the technician cannot allocate the initially used spare disk drive since it is already allocated. Furthermore, if a second spare disk drive is available and the technician allocates the second spare disk drive, the data storage system would then have two allocated spare disk drives.
At this point, a typical next step for the technician is to call the home office by telephone, and obtain technical assistance from a specialist such as someone with intimate knowledge of the disk drive replacement process. The specialist would provide detailed instructions that enable the technician to complete the disk drive replacement process by hand (i.e., without further using the script). In particular, the specialist would explain to the technician how to manually replace the second faulty disk drive with the new disk drive. The specialist would then explain how to transfer the recovered data from the spare disk drive to the new disk drive. Finally, the specialist would explain how to return the spare disk drive to an available state in order to manually complete the disk drive replacement procedure.
In some situations, the specialist may not be trained well enough to properly guide the technician through a servicing procedure. In such a situation, the technician may need to talk directly with an engineer. In these situations, the engineer is taken away from attending to other important such as designing new products.
Additionally, the specialist o
Schreiber Moshe
Sguazzin Stefano
Shatil Arod
Beausoliel Robert
Chapin & Huang , L.L.C.
EMC Corporation
Huang, Esq. David E.
McCarthy Christopher
LandOfFree
Methods and apparatus for performing a maintenance procedure... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Methods and apparatus for performing a maintenance procedure..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatus for performing a maintenance procedure... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3228051