Reliable distributed shared memory

Error detection/correction and fault detection/recovery – Data processing system error or fault handling – Reliability and availability

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C711S163000, C707S793000

Reexamination Certificate

active

06574749

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to distributed shared memory and, in particular, a method for replicating state to result in reliable distributed shared memory.
BACKGROUND OF THE INVENTION
Distributed Shared Memory (DSM) has been an active field of research for a number of years. A variety of sophisticated approaches have been developed to allow processes on distinct systems to share a virtual memory address space, but nearly all of this work has been focussed on enabling shared memory based parallel scientific applications to be run on distributed systems. Examples of such scientific applications appear in computational fluid dynamics, biology, weather simulation and galaxy evolution. In studying such parallel systems, the principal focus is on achieving a high degree of performance.
While the domain of parallel scientific applications is important, distributed shared memory can also play a valuable role in the design of distributed applications. The rapid adoption of distributed object frameworks (e.g., CORBA and Java RMI) is leading to an increased number of distributed applications, whose functionality is partitioned into coarse-grained components which communicate through object interfaces. These distributed object frameworks are well suited for locating and invoking distributed functionality, and may transparently provide “failover” capabilities, failover being the capability of a system to detect failure of a component and to transfer operations to another functioning component. Many distributed applications, however, require the ability to share simple state (i.e., data) across distributed components, for which distributed shared memory can play a role.
To illustrate the need for the ability to share simple state across distributed components, consider a typical web-based service framework which allows new services to be readily added to the system. Some components of the framework deal with authenticating the user, establishing a session and presenting the user with a menu of services. The services are implemented as distinct distributed components, as are the various components of the framework itself In this type of system, a user-session object, encapsulating information about a user's session with the system, would represent simple state that must be available to every component and which is accessed frequently during the handling of each user request. If such a user-session object were only accessible through a remote interface, obtaining information such as the user's “customer-id” would be very expensive. Ideally, the usersession object would be replicated on nodes where it is required, and a session identifier would be used to identify each session.
As soon as data replication is considered, data consistency becomes an issue. There are a number of approaches that can be used for this purpose. As just mentioned, a typical starting point is to store shared objects on a single server and use remote object communication to access various fields. When performance is important, one will typically introduce caching mechanisms to allow local access to certain objects whenever possible. In practice, ad hoc caching and consistency schemes are used for this purpose, individually tailored for each object in question. Given the complexity and ensuing maintainability issues, such steps are not undertaken lightly.
Distributed shared memory (DSM), however, is ideally suited to this problem domain. Using “weak consistency” DSM techniques, state can be very efficiently replicated onto nodes where it is required, with very little additional software complexity. When an object is first accessed on a node, its data pages are brought onto the local processor and subsequent accesses occur at memory access speeds. An underlying DSM layer maintains consistency among the various copies.
Weak consistency refers to the way in which shared memory that is replicated on different nodes is kept consistent. With weak consistency, accesses to synchronization variables are sequentially consistent, no access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere and no data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed (see M. Dubois, C. Scheurich, and F. Briggs, “Memory Access Buffering in Multiprocessors,” International Symposia on Computer Architecture 1986, pp. 434-442., incorporated herein by reference).
Since existing DSM research has been focussed on parallel scientific computation, there are a number of issues that have not been addressed. First, existing DSM systems typically assume that all the nodes and processes involved in a computation are known in advance, which is not true of most distributed applications. Second, many existing DSM systems are not designed to tolerate failures, either at the node level or the application level, which will almost certainly occur in any long-running distributed application. Third, distributed applications will often have several processes running on a given node, which should be taken into account in the design of the DSM system. Finally, distributed applications have a much greater need for general-purpose memory allocation and reclamation facilities, when one is not dealing with fixed-sized multidimensional arrays allocated during application initialisation (which is typical of scientific applications).
In addition, DSM systems can be augmented to be fault tolerant by ensuring that all data is replicated to a parameterizable degree at all times. Although doing so leads to some level of overhead (on write operations), this cost may be warranted for some types of data and may still provide much better performance than storing data in secondary storage (via a database). By using a fault-tolerant DSM system for all in-memory critical data, a distributed application can easily be made to be highly available. A highly available system is one that continues to function in the presence of faults. However, unlike most fault-tolerant systems, failures in a highly available system are not transparent to clients.
Traditional in-memory data-replication schemes include primary site replication and active replica replication. In primary site replication, read and write requests for data are made to a primary (or master) site, which is responsible for ensuring that all replicas are kept consistent. If the primary fails, one of the replicas is chosen as the primary site. In active replica replication, write requests for data are made to all replica sites, using an algorithm that ensures that all writes are performed in the same order on all hosts. Read requests can be made to any replica.
Adapting a DSM system for fault tolerance is quite different than traditional in-memory data replication schemes in that the set of nodes replicating data are the ones that are actively using it. In the best case, where a single node is the principal accessor for an object and where the majority of memory operations are read operations (a read-mostly object) performance using distributed lock leasing algorithms approaches that of a local object. A distributed lock leasing algorithm is an algorithm that allows one node among a set of nodes to acquire a “lock” on a unit of shared memory for some period of time. A lock is an example of a synchronization variable since it synchronizes the modification of a unit of memory, i.e. ensures the unit is only modified by one processor at a time. If a node should fail while holding a lock, the lock is reclaimed and granted to some other node in such a way that all correctly functioning nodes agree as to the state of the lock. Read-mostly objects that are actively shared amongst several nodes may be more costly as more lock requests will involve remote communications. Most expensive may be highly shared objects that are frequently modified.
Some alternative approaches to introducing fault tolerance have been to make the DSM “recoverable”, that is, allowing the system to be recov

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Reliable distributed shared memory does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Reliable distributed shared memory, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Reliable distributed shared memory will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3149000

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.