Electrical computers and digital processing systems: virtual mac – Task management or control
Reexamination Certificate
1999-05-05
2004-07-13
An, Meng-Al T. (Department: 2127)
Electrical computers and digital processing systems: virtual mac
Task management or control
C718S102000, C718S104000, C718S107000, C709S202000, C709S226000, C712S001000, C712S028000
Reexamination Certificate
active
06763519
ABSTRACT:
BACKGROUND OF THE INVENTION
Over the past five years, many researchers have developed new software systems for improved communication in multiprocessor systems. The goal has been to achieve low latency and high bandwidth. The approach has been to remove much of the redundancy in protocol stacks, and eliminate buffer copying by integration of kernel-space with user-space buffer management. For details, see for example [1,2,3,14,16,20,21]. This work has demonstrated that impressive improvements in latency can be achieved when sending a single message or for sparse communication patterns. However, for many of the dense communication patterns which arise in practice, e.g., when large coarse-grain parallel applications are run on moderately parallel architectures, this locally controlled approach to communications management has not provided an effective and efficient solution.
Inefficient, locally controlled communication is a legacy of the dominance of synchronous message passing as a parallel programming style throughout the 1980s and early 1990s. If implemented directly, i.e., as the programmer specifies, then many message passing programs will be inherently inefficient in many cases. This is because any one process in a message passing program cannot efficiently coordinate its actions within the machine with all the other processes which are simultaneously computing, communicating and synchronizing. A process in a message passing program cannot normally combine messages with others for the same destination, nor can it tell if it is a good time to send because the network is lightly loaded, or a bad time because it is heavily congested.
Bulk synchronous parallelism (BSPY, described in [17,19] which are incorporated herein by reference in their entirety, is an alternative programming style in which a program is structured as a sequential composition of parallel “supersteps”. BSP programs can achieve major communication performance improvements over equivalent message passing programs. These improvements are based on the use of global control techniques such as
Batch communication. Efficient oblivious routing of general communications as a series of structured communication patterns such as total exchange [7,8,9].
Repackaging. All messages for a given destination are combined. The combined message is packaged and sent in the most efficient form for the particular network [11].
Destination scheduling. The order in which messages are sent is globally managed and controlled in order to avoid contention at the destination nodes. The global control is non-oblivious, i.e., the scheduling strategy is dynamically determined by the system, taking account of the particular structure of the communication pattern to be realized. This information is determined from data sent during the BSP barrier synchronization [11,17].
Pacing. The injection of new messages into the network is globally controlled to ensure that the total applied load is maximized, but that it does not exceed the level at which network throughput will start to decrease. As with destination scheduling, a non-oblivious method is used to minimize contention [6].
Global implicit acknowledgments. The structure of a BSP computation allows an implementation to avoid sending unnecessary acknowledgments in certain cases. Suppose that the packets from one superstep are colored red, and those of the next superstep are colored green. If processor i has sent red packets to processor j which are currently unacknowledged, and processor i receives a green packet from processor k, then it knows that all processors must have (logically) passed through a global barrier synchronization, and it can regard all red packets as implicitly acknowledged [4].
These techniques rely on the semantics of the BSP model, and on exploiting knowledge about the global structure of the communications. With BSP, a process that is about to communicate can know exactly which other processes are about to communicate, and how much they plan to send, and can therefore know the global applied load. Exploiting this knowledge has been shown to significantly improve the performance of communications on point-to-point networks. This BSP-based globally controlled communication allows one to achieve low latency and high bandwidth in those situations which arise frequently in practical applications, where traffic is dense and irregular.
Globally controlled communication for single program systems has been used to develop high performance native implementations of BSPlib [10], a BSP communications library, for a number of cluster [3,15], symmetric multiprocessor (SMP) [3] and massively parallel processor [3] architectures, in particular, for the following parallel architectures: Silicon Graphics Origin 2000, Silicon Graphics Power Challenge, Cray T3D, Cray T3E, IBM SP2, Convex Exemplar, Sun Multiprocessor, Digital 8400, Digital Alpha Farm, Hitachi SR2001, Fujitsu AP1000. Generic implementations of BSPlib for single program systems have also been produced for any scalable parallel architecture which provides at least one of the following: TCP/IP, UDP/IP, Unix System V Shared Memory, MPI, Cray SHMEM, PARMACS. Papers [4,5,6,7,11,17] describe the above techniques for globally controlled communication in single program systems.
FIG. 1
illustrates the methods of message combining and destination scheduling in a single BSP program. It shows how the messages in a single BSP program can be combined and globally scheduled to avoid unnecessary overheads and minimize destination contention. We briefly describe this operation. In superstep t
100
:
Processor P
1
105
has four messages, identified by the target numbers corresponding respectively to processors P
2
, P
3
, P
4
, P
2
.
Processor P
2
110
has two messages, one for P
1
, and one for P
3
.
Processor P
3
115
has two messages, one for P
2
, and one for P
4
.
Processor P
4
120
has five messages, for processors P
2
, P
1
, P
2
, P
2
and P
3
.
When all of the processors have completed their local computations, and initiated all of their communications, a barrier synchronization
125
is performed.
On each processor, messages destined for the same processor are combined (Step
130
). The injection of these combined messages into the network is scheduled to avoid destination contention. As shown in the figure, this can be done by first realizing the permutation
(P
1
→P
2
, P
2
→P
3
, P
3
→P
4
, P
4
→P
1
) (Step
135
), then realizing
(P
1
→P
3
, P
2
→P
4
, P
3
→P
1
, P
4
→P
2
) (Step
140
), then finally realizing
(P
1
→P
4
, P
2
→P
1
, P
3
→P
2
, P
4
→P
3
) (Step
145
).
When a processor has received all of its incoming messages
150
it can begin the local computations of superstep t+1
155
.
Efficient global communication in the BSP style is discussed in U.S. Pat. No. 5,083,265, “Bulk-Synchronous Parallel Computer,” issued to Leslie G. Valiant and incorporated herein by reference in its entirety. A global mechanism for realizing complex patterns of communication for which it is efficacious for packets to a common destination to be combined at nodes other than the sender or receiver is discussed in U.S. Pat. No. 5,608,870, “System for combining a plurality of requests referencing a common target address into a single combined request having a single reference to the target address,” also issued to Leslie G. Valiant and incorporated herein by reference in its entirety. The techniques and mechanisms for realizing multiprogrammed BSP machines that are the substance of the present invention are not, however, anticipated in either of these earlier patents.
Work on the BSP model [17,19] has also shown that the resources available on any multiprocessor system can be accurately characterized in terms of a concise “machine signature” which describes number of processors, processor speed, global communications capacity, global synchronization cost
Donaldson Stephen R.
Hill Jonathan M. D.
McColl William F.
Valiant Leslie G.
An Meng-Al T.
Riddle G. Mack
Smart John A.
Sychron Inc.
Vo Lilian
LandOfFree
Multiprogrammed multiprocessor system with lobally... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Multiprogrammed multiprocessor system with lobally..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multiprogrammed multiprocessor system with lobally... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3207754