Dynamic load balancing among processors in a parallel computer

Electrical computers and digital processing systems: multicomput – Computer-to-computer data routing – Least weight routing

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C709S241000, C709S241000

Reexamination Certificate

active

06292822

ABSTRACT:

FIELD OF THE INVENTION
The invention relates to parallel processing on distributed memory and shared memory parallel computers having multiple processors.
BACKGROUND OF THE INVENTION
It is hard to program parallel computers. Dealing with many processors at the same time, either explicitly or implicitly, makes parallel programs harder to design, analyze, build, and evaluate than their serial counterparts. However, using a fast serial computer to avoid the problems of parallelism is often not enough. There are always problems that are too big, too complex, or whose results are needed too soon.
Ideally, a parallel programming model or language should provide the same advantages we seek in serial languages: portability, efficiency, and ease of expression. However, it is typically impractical to extract parallelism from sequential languages. In addition, previous parallel languages have generally ignored the issue of nested parallelism, where the programmer exposes multiple simultaneous sources of parallelism in an algorithm. Supporting nested parallelism is particularly important for irregular algorithms, which operate on non-uniform data structures (for example, sparse arrays, trees and graphs).
Parallel Languages and Computing Models
The wide range of parallel architectures make it difficult to create a parallel computing model that is portable and efficient across a variety of architectures. Despite shifts in market share and the demise of some manufacturers, users can still choose between tightly-coupled shared-memory multiprocessors such as the SGI Power Challenge, more loosely coupled distributed-memory multicomputers such as the IBM SP2, massively-parallel SIMD machines such as the MasPar MP-2, vector supercomputers such as the Cray C90, and loosely coupled clusters of workstations such as the DEC SuperCluster. Network topologies are equally diverse, including 2D and 3D meshes on the Intel Paragon and ASCI Red machine, 3D tori on the Cray T3D and T3E, butterfly networks on the IBM SP2, fat trees on the Meiko CS-2, and hypercube networks on the SGI Origin2000. With extra design axes to specify, parallel computers show a much wider range of design choices than do serial machines, with each choosing a different set of tradeoffs in terms of cost, peak processor performance, memory bandwidth, interconnection technology and topology, and programming software.
This tremendous range of parallel architectures has spawned a similar variety of theoretical computational models. Most of these are variants of the original CRCW PRAM model (Concurrent-Read Concurrent-Write Parallel Random Access Machine), and are based on the observation that although the CRCW PRAM is probably the most popular theoretical model amongst parallel algorithm designers, it is also the least likely to ever be efficiently implemented on a real parallel machine. That is, it is easily and efficiently portable to no parallel machines, since it places more demands on the memory system in terms of access costs and capabilities than can be economically supplied by current hardware. The variants handicap the ideal PRAM to resemble a more realistic parallel machine, resulting in the locality-preserving H-PRAM, and various asynchronous, exclusive access, and queued PRAMs. However, none of these models have been widely accepted or implemented.
Parallel models which proceed from machine characteristics and then abstract away details—that is, “bottom-up” designs rather than “top-down”—have been considerably more successful, but tend to be specialized to a particular architectural style. For example, LogP is a low-level model for message-passing machines, while BSP defines a somewhat higher-level model in terms of alternating phases of asynchronous computation and synchronizing communication between processors. Both of these models try to accurately characterize the performance of any message-passing network using just a few parameters, in order to allow a programmer to reason about and predict the behavior of their programs.
However, the two most successful recent ways of expressing parallel programs have been those which are arguably not models at all, being defined purely in terms of a particular language or library, with no higher-level abstractions. Both High Performance Fortran and the Message Passing Interface have been created by committees and specified as standards with substantial input from industry, which has helped their widespread adoption. HPF is a full language that extends sequential Fortran with predefined parallel operations and parallel array layout directives. It is typically used for computationally intensive algorithms that can be expressed in terms of dense arrays. By contrast, MPI is defined only as a library to be used in conjunction with an existing sequential language. It provides a standard message-passing model, and is a superset of previous commercial products and research projects such as PVM and NX. Note that MPI is programmed in a control-parallel style, expressing parallelism through multiple paths of control, whereas HPF uses a data-parallel style, calling parallel operations from a single thread of control.
Nested and Irregular Parallelism
Neither HPF or MPI provide direct support for nested parallelism or irregular algorithms. For example, consider the quicksort algorithm set forth below. The irregularity comes from the fact that the two subproblems that quicksort creates are typically not of the same size; that is, the divide-and-conquer algorithm is unbalanced.
procedure QUICKSORT(S):
if S contains at most one element
then
return S
else
begin
choose an element a randomly from S;
let S
1
, S
2
and S
3
be the sequences of elements in S
less than, equal to, and greater than &agr;, respectively;
return (QUICKSORT(S
1
) followed by S
2
followed by QUICKSORT(S
3
))
end
Although it was originally written to describe a serial algorithm, the pseudocode shown above contains both data-parallel and control-parallel operations. Comparing the elements of the sequence S to the pivot element &agr;, and selecting the elements for the new subsequences S
1
, S
2
, and S
3
are inherently data-parallel operations. Meanwhile, recursing on S
1
and S
3
can be implemented as a control-parallel operation by performing two recursive calls in parallel on two different processors.
Note that a simple data-parallel quicksort (such as one written in HPF) cannot exploit the control parallelism that is available in this algorithm, while a simple control-parallel quicksort (such as one written in a sequential language and MPI) cannot exploit the data parallelism that is available. For example, a simple control-parallel divide-and-conquer implementation would initially put the entire problem onto a single processor, leaving the rest of the processors unused. At the first divide step, one of the subproblems would be passed to another processor. At the second divide step, a total of four processors would be involved, and so on. The parallelism achieved by this algorithm is proportional to the number of threads of control, which is greatest at the end of the algorithm. By contrast, a data-parallel divide-and-conquer quicksort would serialize the recursive applications of the function, executing one at a time over all of the processors. The parallelism achieved by this algorithm is proportional to the size of the subproblem being operated on at any instant, which is greatest at the beginning of the algorithm. Towards the end of the algorithm there will be fewer data elements in a particular function application than there are processors, and so some processors will remain idle.
By simultaneously exposing both nested sources of parallelism, a nested parallel implementation of quicksort can achieve parallelism proportional to the total data size throughout the algorithm, rather than only achieving full parallelism at either the beginning (in data parallelism) or the end (in control parallelism) of the algorithm. The benefits of a nested parallel implementation are illustrated in more detail in Hardwick, Jonathan, C., Pra

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Dynamic load balancing among processors in a parallel computer does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Dynamic load balancing among processors in a parallel computer, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Dynamic load balancing among processors in a parallel computer will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2446124

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.