Electrical computers and digital processing systems: interprogra – Miscellaneous
Reexamination Certificate
2000-03-02
2004-11-16
Lan, Sue (Department: 2126)
Electrical computers and digital processing systems: interprogra
Miscellaneous
C707S793000
Reexamination Certificate
active
06820264
ABSTRACT:
FIELD OF THE INVENTION
The invention relates to multiprocessor computers and more particularly to a message passing interface (MPI) application programming interface (API) for passing messages between multiple tasks or processes.
TRADEMARKS
S/390 and IBM are registered trademarks of International Business Machines Corporation, Armonk, N.Y., U.S.A. and Lotus is a registered trademark of its subsidiary Lotus Development Corporation, an independent subsidiary of International Business Machines Corporation, Armonk, N.Y. Other names may be registered trademarks or product names of International Business Machines Corporation or other companies.
BACKGROUND
Message Passing Interface (MPI) defines a standard application programming interface (API) for using several processes at one time to solve a single large problem called a “job” on a multiprocessor and often multi-node computer (i.e., commonly one process per node). Each job can include multiple processes. A process can also commonly be referred to as a task. Each process or task can compute independently except when it needs to exchange data with another task. The program passes the data from one task to another as a “message.” Examples of multiprocessor computers are, e.g., an IBM RISC System 6000/SP available from IBM Corporation, Armonk, N.Y., and supercomputers available from Cray, Silicon Graphics, Hewlett Packard, Thinking Machines, and the like.
Specifically, a programmer can use an explicit MPI_SEND to identify what data from the memory of the source task is to be sent as a given message. The programmer can also use an explicit MPI_RECV at the destination task to identify where the data is to be placed in the receiver memory.
In conventional message passing, a send or receive call would identify a memory address and byte count. This is restrictive because it is common for the content which logically comprises a message to be discontiguous in memory.
The conventional approach is, however, a neat fit to the lower level transport model which treats data to be moved between tasks as byte streams.
The conventional solutions have been to send a distinct message for each contiguous unit, or to allocate a scratch buffer and copy or pack the discontiguous data into the scratch buffer before sending. These techniques add programmer burden and execution time overhead.
For example, in a 10×10 matrix of integers, M, stored row major, a row is 10 contiguous integers but a column is every 10th integer. The programmer with a row to send could exploit the fact that the data was already contiguous and could use a simple send. Conventionally, the programmer with a column to send would need to use one of the more awkward techniques. Similar considerations apply to a receive where the eventual destination of the data may not be contiguous.
To simplify the description which follows, sending of messages will be focused upon although the same would apply to the receiving of messages. To send a message, data is gathered from memory and fed to the transport layer at the rate that the transport layer is able to accept. Bytes of a message are forwarded in chunks and the transport layer dictates the size of each chunk. When the transport layer is ready to accept N bytes, then N bytes are copied from the proper memory locations into the transport (pipe) buffer. The data gather logic delivers a specific number of bytes at each activation and then at the next activation, picks up where it left off to deliver more bytes.
Receiving a message is a mirror image of the sending of one. Some number of bytes becomes available from a pipe and must be distributed. It would be apparent to those skilled in the art that the concepts involved in sending and receiving are so closely related that to understand one is to understand the other.
The MPI standard addresses the problem of dealing with discontiguous memory data by defining a set of calls which enable the programmer to describe any possible layout of data in memory. It then falls to the MPI implementation to gather data and feed it to the transport layer or receive incoming data and scatter it to user task memory. The description is called an MPI_Datatype and can be visualized as a template with a stride and one or more tupples, each tupple representing a data unit and its offset within the template. For the 10×10 integer matrix, M, mentioned above, assume it is desirable to send a single message taking the first and sixth integer of each row. In conventional message passing a 20 integer long buffer could be allocated and a loop could be coded to copy these 20 integers from their locations in M to the scratch buffer. Then 80 bytes could be sent from the scratch buffer. In MPI, an MPI_Datatype can be defined called, for example, “newtype” to indicate the first and sixth integer of each row: {(0,integer) (20,integer) stride=40} then call MPI_SEND(M,10,newtype, . . . ). The MPI implementation interprets the template 10 times to gather and transmit the 80 bytes.
MPI offers a set of predefined datatypes and a set of constructor calls which allow user-defined datatypes to be built based on the predefined types plus any previously defined user types. Since any new datatype is defined in terms of previously defined types, the natural representation to flow from a sequence of type constructor calls is a tree with predefined types as leaves and user defined types as internal nodes. MPI implementations use such trees to record the datatypes created in a user's code. To gather data to MPI_SEND 10 of “newtype”, an MPI implementation would traverse the tree representing “newtype” 10 times. Most implementations of MPI allocate a temporary buffer large enough for the entire message and do an entire gather at one time and then send from the temporary buffer. A different implementation uses an approach which gathers in increments and which preserves the state of the gather operation from step to step. Like the former implementations, the latter implementation has depended on traversing the tree as many times as needed.
The current approach has several limitations. The MPI standard (MPI-1) was initially defined in a way which allowed all MPI_Datatype information to be local. If two tasks wish to communicate, each task makes its own type constructor calls and each task produces, its own tree-encoded description of a datatype. The sending task would “gather” based on the MPI_SEND type description and the data would flow to the destination which would “scatter” according to the MPI_RECV type description. The programmer constructs datatypes for the MPI_SEND and matching MPI_RECV which were compatible but neither task would have any access to the description used at the other end. Describing a datatype with a tree is adequate (though not ideal) when the description can remain local.
With MPI-2, the MPI standard was extended with features which depend on it being possible for a datatype which is constructed at one task to be sent to another for interpretation at the remote task.
One of the extended features is a one sided communication in which an origin task uses a call, such as, e.g., an MPI_PUT which specifies a first datatype to be used for the local “gather” as well as a second datatype to be used for the “scatter” at the target task. Both datatypes used in the MPI_PUT call are local to the task which calls MPI_PUT but the semantic of the call is as if the origin did an MPI_SEND with the first datatype and the target did a matching MPI_RECV with the second. To produce this semantic, the type description which is available at the origin is packaged and sent to the target in a form which the target task can interpret. One sided communication can include an additional complication in the MPI_ACCUMULATE function. An accumulate is like an MPI_PUT except that at the target, each unit of data which arrives (such as, e.g., integer, 4 byte floating point, 8 byte floating point etc.) can be combined with the data already present by some reduction function (such as, e.g., add, multiply and bit-and).
Another
Bashkansky Guy
Shmulyian Sergei
Sivan Ron
Treumann Richard R.
Cutter Lawrence D.
International Business Machines - Corporation
Lan Sue
Lane Aitken & McCann
Truong LeChi
LandOfFree
Data gather/scatter machine does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Data gather/scatter machine, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Data gather/scatter machine will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3329562