Electrical computers and digital processing systems: multicomput – Computer-to-computer data routing – Least weight routing
Reexamination Certificate
1991-06-04
2001-08-28
Coulter, Kenneth R. (Department: 2154)
Electrical computers and digital processing systems: multicomput
Computer-to-computer data routing
Least weight routing
C712S013000
Reexamination Certificate
active
06282583
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to a computer system and particularly to a parallel processor system adapted to carrying out computationally intensive procedures.
A number of computer applications involve executing scientific algorithms on large arrays of data. Such algorithms are commonly referred to as matrix algorithms and have several significant characteristics: they typically operate on multi-dimensional data arrays, they can be naturally paralleled and may be broken down into blocks, and they involve many computations per data point. Since most general purpose computer systems are adapted to single scalar operations and do not perform array computations efficiently, special computer architectures have been developed that reduce the time necessary to process large arrays of data. However, array processing for complex algorithms is still comparatively time consuming.
Traditional vector supercomputers perform matrix algorithms by decomposing them into a series of vector instructions, reading a vector from memory, performing an arithmetic operation such as a multiply-add, and storing the result back into memory. Operating in this manner often results in only two floating-point operations for every three memory accesses. In order to increase the operation speed of such a series of vector operations, faster memory systems can be employed, but faster memory systems can be prohibitively expensive. Memory bandwidth, i.e., the ability to move data between memory and processors, is typically the greatest expense in building a computer system.
There are many other factors that limit the speed at which a computer system can process an array of data. However, eliminating computer speed roadblocks in certain areas does not always yield the most cost effective means for increasing processing power. Sophisticated board technologies can minimize capacitance, propagation delays and noise, but the goal in developing computer systems is often to achieve high performance at low cost, providing the simplest means for implementing application software.
Another alternative for increasing overall computer processing capacity is increasing the number of processing elements used in the system. Employing twice the number of processors has the potential of reducing process time by a factor of two over a single processor system. However, depending on the application, the time required to perform an algorithm may not be reduced in proportion to an increase in the number of processing elements. Since many computer applications require completion of a first operation before a second operation can begin, a second processor that would normally perform the second operation often must remain idle until the first operation is finished. Such requirements can result in the dual processor computer system requiring approximately the same time to complete a procedure as would be required by a single processor system.
Computer architectures can also limit processing capacity in that bottlenecks during data transfer can cause processing elements to idle. For example, in a single bus computer system, only one processor may access system memory at a time, while other processors connected to the bus idle until the processor presently using the bus completes its transfer. One method of reducing system bottlenecks is to provide computer architectures that perform a few specific algorithms quickly, but their application is limited. Computer bottlenecks may also be reduced by adding buses and allowing multiple accesses to memory at the same time. These methods add cost and complexity to the system and are still restricted by the type of algorithms performed.
Bus conflicts may be reduced with software employing machine code that will coordinate memory accesses for plural processors in a system. However, this method can become too complex to perform effectively as the number of processors increases and, in addition, the software must be rewritten for each system configuration change. Using semaphores for system coordination is counterproductive because accesses to system memory for reading and setting semaphores consume precious memory bandwidth.
A substantial problem encountered with computers employing multiple processors involves tracking the processors and the points at which a data bus is available for subsequent data transmission to global memory. Software compilers for sorting the data and processing commands to each processor have been employed in an attempt to maximize processing capacity by reducing data bus conflicts. However, as the number of processors increases, the multiprocessor compilers become less efficient in allocating bus time between processing elements. Since the bus use for each processor depends on the algorithm that is presently being performed, software apportionment is complex and it is difficult to attain maximum system processing capacity.
FIG. 1A
illustrates a multiprocessor bi-directional bus architecture system with a common global memory as found in the prior art. Interface processor
10
, multiple data processing elements
12
, and global memory
14
are all connected to bus
16
. Instruction data will typically enter the system on bus
16
from mass storage through I/O processor
10
. Processor
10
transmits incoming code into the global memory
14
. Data for each processor is also transmitted over bus
16
to global memory
14
. Each processor may perform a portion of the entire algorithm, or depending on the algorithm and the amount of data, may perform the same algorithm on a different section of the data.
The system of
FIG. 1A
illustrates typical architecture for a multiprocessor work station. Such a work station employs inexpensive bus architecture, thereby providing an economical system. However, the single bus between the central processing units and memory impedes serious supercomputing.
Referring now to
FIG. 1B
, also comprising a block diagram of a vector supercomputer in accordance with the prior art, a plurality of processors
12
are each coupled to a separate I/O interface
10
and also connect to crossbar
18
via multiple crossbar/processor ports
20
, each processor coupling to one or more crossbar/processor ports
20
. The crossbar connects the processors to memory
14
through multiple crossbar/memory ports
22
. Crossbar
18
employs a complex multi-layered crossbar scheme according to the prior art for connecting the multiple processors
12
to memory bank
14
. This complex crossbar and the memory interconnections required for such a system configuration, while effective in enhancing system performance, can be prohibitively expensive.
FIG. 1C
is a block diagram of a relatively simpler computer system illustrating a prior art architecture for increasing bandwidth by allowing concurrent access to multiple ports in the same global memory array. Crossbar
18
connects multiple data processing elements
12
to global memory
14
by means of crossbar/processor ports
20
and crossbar/memory ports
22
, each processor having its own dedicated crossbar/processor port. Memory
14
is provided with multiple input ports, each memory port being coupled to a single crossbar/memory port. Crossbar
18
decodes address values from each processor, connecting the data bus of the processing element that asserted a value on an address bus to the associated memory port in memory
14
. The data on the processor data bus is transferred to or from the memory location through the memory port associated with the address value. An I/O interface
10
is also provided with a crossbar port. While such a system provides increased processing speed, the cost associated with supporting memory transfer bandwidth for each processor on the multiple memory ports is high relative to the gained computing speed. The systems of FIG.
1
B and
FIG. 1C
illustrate typical structures wherein one or more crossbar ports are dedicated to each processor.
SUMMARY OF THE INVENTION
The present invention relates to a computer system including multiple processing elements connected to a
Carlile Bradley R.
Charlesworth Alan E.
Pincus Philip A.
Coulter Kenneth R.
Schwegman Lundberg Woessner & Kluth P.A.
Silicon Graphics Inc.
LandOfFree
Method and apparatus for memory access in a matrix processor... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for memory access in a matrix processor..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for memory access in a matrix processor... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2440326