Electrical computers and digital processing systems: memory – Storage accessing and control – Control technique
Reexamination Certificate
1998-06-15
2001-02-06
Yoo, Do Hyun (Department: 2759)
Electrical computers and digital processing systems: memory
Storage accessing and control
Control technique
C711S161000, C711S162000, C711S147000, C707S793000
Reexamination Certificate
active
06185663
ABSTRACT:
BACKGROUND OF THE INVENTION
A computer system configuration in which a plurality of computers all access a collection of shared memory (i.e., disks) is called a “shared disk system”. Restated, in a shared disk environment, multiple nodes (computers) of a cluster concurrently write to one or more shared disks. A system that exploits such a shared environment, such that multiple nodes access common data, is called a “data sharing” system. A distinguishing feature of data sharing and shared disk systems from other systems is that a data item can reside in the caches of multiple computers (nodes) and can be updateable from these nodes. That is, there is no one computer used as a server or coordinator; all system computers are peers in their access to the shared data.
FIG. 1
illustrates a typical data sharing environment of a computer system
11
. A memory storage
14
comprises persistent storage (p-store) and thus is the system storage whose contents persist when part or all of the computer system
11
crashes. Traditionally, this storage
14
has been magnetic disk. A “page” of persistent storage is an automically updateable entity. An integral number of pages of persistent storage forms a block
18
. Blocks
18
are the recoverable objects/entities of the computer system
11
. Each block
18
is identified by a unique block identifier (BID).
Computer nodes
10
A,
10
B, and
10
C of computer system
11
access memory storage
14
via a storage interconnect subsystem
19
. Each node
10
A, B, C supports one respective local cache
12
A,
12
B,
12
C. Each cache
12
is a simple shared memory and the corresponding computer
ode
10
may have multiple processors. Each cache
12
A, B, C is formed of volatile storage for holding certain data blocks
18
. A node
10
can operate only on blocks
18
which have been read from memory storage
14
into respective local cache
12
. Because the local cache
12
is formed of volatile storage, the contents (i.e., locally stored data blocks
18
) of the cache
12
may be lost during a system
11
crash.
Each node
10
A,
10
B,
10
C has a respective, dedicated transaction log
16
A,
16
B,
16
C. Each transaction log
16
A, B, C is formed of a series of records, one record for each operation (or action) performed by the respective node
10
on corresponding cache data.
In the case of prior art data base systems, each node
10
is a cooperating computer running software to implement a database system. To that end, a node
10
reads a desired data block
18
from memory storage
14
into the node's local cache
12
. The node
10
writes into a respective transaction log
16
logical changes to (i.e., operations on) data blocks
18
stored in local cache
12
. Each transaction log
16
is a sequential file used to record information that permits an operation of a transaction to be performed again (redone), should the system
11
crash prior to committed data being written to memory storage
14
. Recovery involving the transaction log
16
assures the durability of transaction effects in memory storage
14
.
That is, after normal operation of the nodes
10
writing entries to respective transaction logs
16
, computer system
11
writes the data back to memory storage
14
. A distributive lock manager coordinates who (which node
10
) updates a data block of
18
for the purposes of serializing operations on the data blocks
18
. If the system
11
crashes during the time of data being updated and written back to memory storage
14
, a recovery mode of operation must be undertaken.
The transaction logs provide recovery of state of the database after a system crash or failure. The illustrated data sharing system
11
utilizes one transaction log
16
for each node/computer
10
. However, single log recovery approaches also exist. In the case of a single/central transaction log, a synchronization mechanism is required to enable the multiple nodes to write to the single transaction log during operation of the nodes. An example of a single log system is Microsoft's NTFS.
However, better scalability and performance is achieved using a multiple log approach as disclosed by David B. Lomet in “Recovery for Shared Disk Systems Using Multiple Redo Logs,” Digital Equipment Corporation Cambridge Research Lab, Report CRL 90/4, Oct. 1, 1990. In Lomet, a separate transaction log
16
for each node
10
is used. This avoids the synchronization issues of the single log approaches. Lomet describes the technique of storing so called “before state identifiers” (BSI's) in each transaction log record. This means that each transaction log record contains the precise identity of the data block state that is seen before the logged action is performed.
The process of recovery makes all blocks
18
“current” (i.e., all updates to the blocks have been applied). Transaction log records are applied to a block
18
until there are no more transaction log records for that block
18
. At this point the block
18
is said to be current. Further, the block
18
is considered to be “One-Log” because all transaction log records that need to be applied to the block
18
are held in a single transaction log
16
. More specifically, the block is required to be One-log and a certain protocol is used to achieve this. Namely, the block is written to disk when ownership moves from one node to another.
A state identifier is used in Lomet to determine whether or not to apply a particular transaction log record. A state identifier is a definitive description of the block
18
contents and is much smaller than the complete state. A state identifier is defined by the system
11
storing a defining state identifier (DSI) in block
18
. The DSI denotes the block state of the block in which the DSI is included.
During recovery, the system
11
knows whether a transaction log record applies to a block state by testing for equality between the transaction log record BSI and the defining state identifier (DSI) in the block. If the transaction log record BSI is determined to equal the block DSI, then the operation recorded in the transaction log record is executed on the data to transform the subject data block of the transaction record from its present state to a successor state.
As such, Lomet's scheme avoids concurrency control and coordination required for N-log recovery, by enforcing one-log redo and maintaining the independent redo recovery assumption. That assumption provides that the state of a data block
18
is able to be transformed to a successor state during redo recovery based solely on the state of the block at the time the original operation was applied and the contents of the transaction log record that describes the operation.
Using BSI's in this manner means that it is possible to identify which transaction log record to apply to a block
18
in some state, independently of transaction log
16
ordering or the number of transaction logs
16
. One does not rely on the relative ordering of records between transaction logs
16
; the test of determining whether or not to apply a transaction log record to a block
18
is inherent in their (the transaction log records) states. With the uniqueness of state identifiers and the one-log condition, during recovery, only one transaction log will contain a transaction log record with a BSI that matches a block's DSI.
Of further note, in a typical database system, the content of every block
18
is directly managed by the database system
11
. Therefore, it is practical to include a state identifier in the definition of every block. Each block may be initialized in memory storage
14
when it is first allocated to the database. Initialization is an expensive operation, but it need not occur during normal operation of the database. Alternatively, the initial random state found in the data block
18
could be used as its initial state identifier. This requires that the block be read from memory storage
14
before its first use.
When a free (not allocated for use) data block
18
is reused, the previous contents are
Compaq Computer Corporation
Hamilton Brook Smith & Reynolds P.C.
Moazzami Nasser
Yoo Do Hyun
LandOfFree
Computer method and apparatus for file system block... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Computer method and apparatus for file system block..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Computer method and apparatus for file system block... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2610368