Electrical computers and digital processing systems: multicomput – Computer-to-computer data routing – Least weight routing
Reexamination Certificate
1996-07-12
2001-06-12
Banankhah, Majid A. (Department: 2151)
Electrical computers and digital processing systems: multicomput
Computer-to-computer data routing
Least weight routing
Reexamination Certificate
active
06247038
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to synchronization of transactions in data processing systems. More particularly it relates to the synchronization of a transaction in a data processing system including a plurality of agents participating in the transaction and one coordinator for coordinating said transaction, the agents including at least a middleman coordinating a set of at least one of the agents, including the steps of sending a vote indicating the availability or non-availability to commit from each of the agents to the coordinator, and determining a commit or backout decision by the coordinator when all the votes are received.
BACKGROUND OF THE INVENTION
In data processing systems, access and updates to system resources are typically carried out by the execution of discrete transactions (or units of work). A transaction is a sequence of coordinated operations on system resources such that either all of the changes take effect or none of them does. These operations are typically changes made to data held in storage in the transaction processing system; system resources include databases, data tables, files, data records and so on. This characteristic of a transaction being accomplished as a whole or not at all is also known as atomicity.
In this way, resources are prevented from being made inconsistent from each other. If one of the set of update operations fails then the others must also not take effect. A unit of work then transforms a consistent state of resources into another consistent state, without necessarily preserving consistency at all intermediate points.
The atomic nature of transactions is maintained by means of a transaction synchronization procedure commonly called the commit procedure. Logical points of consistency at which resource changes are synchronized within transaction execution are called commit points or syncpoints. An application ends a unit of work by declaring a syncpoint, or by the application terminating.
Atomicity of a transaction is achieved by resource updates made within the transaction being held in-doubt (uncommitted) until a syncpoint is declared at completion of the transaction. If the transaction succeeds, the results of the transaction are made permanent (committed); if the transaction fails, all effects of the unsuccessful transaction are removed (backed out). That is, the resource updates are made permanent and visible to applications other than the one which performed the updates only on successful completion. For the duration of each unit of work, all updated resources must then be locked to prevent further update access. On the contrary, when a transaction backs out (or rolls back), the resources are restored to the consistent state which existed before the transaction began.
There are a number of different transaction processing systems commercially available; an example of an on-line transaction processing system is the CICS system developed by International Business Machines Corporation (IBM is a registered trademark and CICS is a trademark of International Business Machines Corporation).
In a transaction data processing system which includes either a single node where transaction operations are executed or which permits such operations to be executed at only one node during any transaction, atomicity is enforced by a single-phase synchronization operation. In this regard, when the transaction is completed, the node, in a single phase, either commits to make the changes permanent or backs out.
In distributed systems encompassing a multiplicity of nodes, a transaction may cause changes to be made to more than one of such nodes. In such a system, atomicity can be guaranteed only if all of the nodes involved in the transaction agree on its outcome. A simple example is a financial application to carry out a funds transfer from one account to another account in a different bank, thus involving two basic operations to critical resources: the debit of one account and the credit of the other. It is important to ensure that either both or neither of these operations take effect.
Distributed systems typically use a transaction synchronization procedure called two-phase commit (2PC) protocol to guarantee atomicity. In this regard, assume that a transaction ends successfully at an execution node and that all node resource managers (or agents) are requested to commit operations involved in the transaction. In the first phase of the protocol (prepare phase), all involved agents are requested to prepare to commit. In response, the agents individually decide, based upon local conditions, whether to commit or back out their operations. The decisions are communicated to a synchronization location, called the coordinator, where the votes are counted. In the second phase (commit phase), if all agents vote to commit, a request to commit is issued, in response to which all of the agents commit their operations. On the other hand, if any agent votes to back out its operation, all agents are instructed to back out their operations. In a large system with a high volume of transactions, the two phase commit process may arrange the agents in a tree like manner in which one of a subset of agents acts as a middleman to coordinate the votes of the subset and send a combined vote to the main coordinator.
Distributed systems are organized in order to be largely recoverable from system failures, either communication failures or node failures. A communication failure and a failure in a remote node generally manifest themselves by the cessation of messages to one or more nodes. Each node affected by the failure can detect it by various mechanisms, including a timer in the node which detects when a unit of work has been active for longer than a preset maximum time. A node failure is typically due to a software failure requiring restarting of the node or a deadlock involving pre-emption of the transaction running on the node.
System failures are managed by a recovery procedure requiring resynchronization of the nodes involved in the unit of work. Since a node failure normally results in the loss of information in volatile storage, any node that becomes involved in a unit of work must write state changes (checkpoints) to non-volatile storage synchronously with the transmission of messages during the two-phase commit protocol. These checkpoint data (or log messages) are written to a stable storage medium as the protocol proceeds to allow the same protocol to be restarted from a consistent state in the case of a failure of the node. This is known as resynchronization.
U.S. Pat. No. 5,311,773 describes how a commit procedure can be resynchronized asynchronously after a failure while allowing an initiating application to proceed with other tasks. It does not, however, address the problem of interruption of communication to multiple partner nodes involved in a distributed unit of work.
The IBM System Network Architecture or IBM SNA LU 6.2 syncpoint architecture developed by International Business Machines Corporation is known to coordinate commits between two or more protected resources. The LU 6.2 architecture supports a syncpoint manager (SPM) which is responsible for resource coordination, syncpoint logging and recovery. A description of the communication protocol used in this architecture is found in the book “SNA Peer Protocols for LU6.2” (ref. SC31-6868-1, IBM Corporation).
A problem with known protocols for two-phase commit across networks is that they do not cater adequately for the case where contact with the coordinator of the unit of work is lost. In such cases, it is not possible to immediately tell other partners of the distributed unit of work what the outcome is. The decision is only known later when contact is made with the coordinator.
If contact is lost, partners can be kept waiting forever until contact is made again. Each of the partners may hold resource locks and keep application code and end users waiting for a long time. Operator action is then required to release locks, applications and end user screens.
A known solutio
Banks Timothy William
Hunter Ian
Lupton Peter James
Normington Glyn
Zimmer Dennis Jack
Banankhah Majid A.
Caldwell P. G.
International Business Machines - Corporation
Ray-yarlette Jeanine S.
LandOfFree
Optimized synchronization procedure does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Optimized synchronization procedure, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Optimized synchronization procedure will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2531559