Methods and apparatus for providing data storage access

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C709S239000

Reexamination Certificate

active

06526521

ABSTRACT:

FIELD OF THE INVENTION
The invention relates generally to data storage resource management within a computer system. More particularly, the invention relates to techniques for providing access to data storage pathways which connect a cluster of nodes to a data storage system.
BACKGROUND OF THE INVENTION
A typical general purpose computer includes a processor, primary memory (e.g., semiconductor memory), secondary memory (e.g., disk memory) and one or more input/output (I/O) devices (e.g., a keyboard, printer, or network interface). Such a computer is generally suitable for situations in which occasional computer downtime does not cause serious problems (e.g., when the computer operates as a website server for advertising, or when the computer is used for playing games).
However, in some situations, even occasional downtime may cause serious problems. For example, in the banking industry, banks risk losing substantial business and goodwill if customers are unable to access their computerized accounts. Similarly, in the travel industry, companies (e.g., hotels, airlines, etc.) stand to lose significant business and will likely sustain damage to their reputations if their computerized reservation records become unavailable. Moreover, certain types of operations (e.g., satellites, nuclear power plants, and government military systems) are controlled by computerized systems, and the loss of such computerized control can be catastrophic.
To improve computer system reliability, computer manufacturers have developed fault-tolerant computer systems. Fault tolerance is a strategy for ensuring that such systems provide continued operation even when certain types of faults arise. One fault-tolerant computer system includes a host computer (or simply host or node), a fault-tolerant data storage system, and a fault-tolerant cabling system that connects the host to the data storage system. In general, the fault-tolerant data storage system includes multiple disks which store data in a manner that enables the data storage system to recover the data if a disk should fail. Various techniques for storing data on multiple disks in order to provide for reliable data recovery in the event of a disk failure are described in Patterson et al., “A Case for Redundant Arrays of Inexpensive Disks (RAID),” ACM SIGMOND Conference, Jun. 1-3, 1988, the teachings of which are incorporated herein by reference in their entirety. An example of a fault-tolerant data storage system that uses RAID is Symmetrix, which is manufactured by EMC Corporation of Hopkinton, Mass.
In general, the fault-tolerant cabling system includes multiple cables that provide multiple data storage pathways between the host and the data storage system. If a cable is cut or otherwise damaged, the remaining cables will provide continued connectivity between the host and the data storage system. Additionally, traffic through the cables can be load balanced to further enhance data exchange efficiency between the host and the data storage system. An example of such a cabling system is PowerPath, which is manufactured by EMC Corporation of Hopkinton, Mass.
Another example of a fault-tolerant computer system is a cluster of computerized nodes. Each node runs one or more applications to perform a set of operations. If the cluster is configured to provide failover protection in the event of a node failure and if a first node of the cluster fails (e.g., due to a hardware failure, or a failure of all the data storage pathways connecting the first node to a data storage system), a second node that continues to run typically will detect the loss of the first node (e.g., by exceeding a timeout period in which the second node expected to receive a handshaking signal from the first node). The second node will then automatically restart and run the applications that ran on the first node. The automatic migration of an application from the first node to the second node due to the failure of the first node is called a failover operation. Since such migration will enable the second node to continue performing the operations initially performed by the first node, the cluster as a whole will remain operational even though the first node will have failed.
SUMMARY OF THE INVENTION
In a conventional cluster of nodes, a first node may fail due to a complete failure of all the data storage pathways connecting the first node to a data storage system. If the cluster is configured to provide failover protection, a second node of the cluster typically will detect this failure by determining that the first node has not provided an expected signal within a timeout period (i.e., the first node has timed out). The second node will then automatically restart and run applications that initially ran on the first node in order for the cluster to remain operational.
There may be occasions when the first node does not completely fail but suffers degradation in its processing rate. In such a situation, the first node may continue to provide the expected signal, the loss of which would have enabled the second node to determine that the first node had failed. For example, suppose that the set of data storage pathways connecting the first node to the data storage system experiences congestion problems. Perhaps, the majority of data storage pathways in the set of data storage pathways are cut or damaged and thus become unavailable for data transfer. The first node may remain operational by routing data traffic, which would have normally passed through the unavailable data storage pathways, through the remaining intact data storage pathways. Although, at this point, the throughput or processing speed of the first node may be significantly reduced or limited by data congestion through the remaining intact data storage pathways, the first node continues to perform its original operations at a substantially reduced rate and prevents a timeout or failover condition from occurring (e.g., by providing appropriate handshaking signals to other nodes of the cluster).
In contrast to conventional mechanisms that perform failover operations when cluster nodes become unavailable but that avoid such failover operations when the nodes continue to operate at reduced rates, the invention is directed to techniques for providing access to data storage pathways that connect a cluster of nodes to a data storage system in a manner that enables a failover operation to occur from a first node to a second node when the first node suffers pathway degradation forcing the first node to operate significantly slower than previously, even when the first node retains access to the data storage system through one or more available data storage pathways. Such a failover operation from the degraded first node to the second node allows the cluster to continue performing operations at a rate that is superior to that provided by the degraded first node.
In accordance with an embodiment of the invention, a cluster of nodes connects to a data storage system through multiple sets of data storage pathways. A cluster framework and a set of pathway resource agents operate on the cluster of nodes. In particular, a respective portion of the cluster framework and a respective pathway resource agent operate on each node. The pathway resource agents receive, from the cluster framework, instructions for controlling the pathway sets and, in response, determine which of the pathway sets are available for transferring data between the cluster of nodes and the data storage system in accordance with predetermined access conditions. The pathway resource agents then provide, to the cluster framework, operation states identifying which of the pathway sets are available for transferring data between the cluster of nodes and the data storage system in accordance with the predetermined access conditions. The cluster framework can then access the pathway sets based on the operation states.
Preferably, the predetermined access conditions include, for each node, a data storage pathway availability threshold. In one arrangement, the threshold for e

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Methods and apparatus for providing data storage access does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Methods and apparatus for providing data storage access, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatus for providing data storage access will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3161629

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.