Electrical computers and digital processing systems: processing – Processing architecture – Array processor
Reexamination Certificate
2000-08-25
2004-06-22
Pan, Daniel H. (Department: 2183)
Electrical computers and digital processing systems: processing
Processing architecture
Array processor
C712S013000, C712S015000, C712S018000, C712S019000, C711S173000, C711S003000
Reexamination Certificate
active
06754802
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to the field of massively parallel processing systems, and more particularly to the interconnection among processing elements and between processing elements and memory in a single chip massively parallel processor chip.
BACKGROUND OF THE INVENTION
The fundamental architecture used by all personal computers (PCs) and workstations is generally known as the von Neumann architecture, illustrated in block diagram form in FIG.
1
. In the von Neumann architecture, a main central processing unit (CPU)
10
is coupled via a system bus
11
to a memory
12
. The memory
12
, referred to herein as “main memory”, also contains the data on which the CPU
10
operates. In modern computer systems, a hierarchy of cache memories is usually built into the system to reduce the amount of traffic between the CPU
10
and the main memory
12
.
The von Neumann approach is adequate for low to medium performance applications, particularly when some system functions can be accelerated by special purpose hardware (e.g., 3D graphics accelerator, digital signal processor (DSP), video encoder or decoder, audio or music processor, etc.). However, the approach of adding accelerator hardware is limited by the bandwidth of the link from the CPU/memory part of the system to the accelerator. The approach may be further limited if the bandwidth is shared by more than one accelerator. Thus, the processing demands of large data sets, such as those commonly associated with large images, are not served well by the von Neumann architecture. Similarly, as the processing becomes more complex and the data larger, the processing demands will not be met even with the conventional accelerator approach.
It should be noted, however, that the von Neumann architecture has some advantages. For example, the architecture contains a homogenous memory structure allowing large memories to be built from many smaller standard units. In addition, because the processing is centralized, it does not matter where the data (or program) resides in the memory. Finally, the linear execution model is easy to control and exploit. Today's operating systems control the allocation of system memory and other resources using these properties. The problem is how to improve processing performance in a conventional operating system environment where multiple applications share and partition the system resources, and in particular, the main memory.
One solution is to utilize active memory devices, as illustrated in
FIG. 2
, in the computer system. Put simply, active memory is memory that can do more than store data; it can process it too. To the CPU
10
the active memory
15
looks normal except that it can be told to do something with the data contents and without the data being transferred to the CPU or another part of the system (via the system bus
11
). This is achieved by distributing an array
14
of processing elements (PEs)
200
throughout the memory structure, which can all operate on their own local pieces of memory in parallel. The array
14
of PEs
200
are coupled to the memory
12
via an high speed connection network
13
. In addition, PEs
200
of the array
14
can communication with each other. Thus, active memory encourages a somewhat different view of the computer architecture, i.e., “memory centered” or viewed from the data rather than the processor.
In a computer system having active memory, such as illustrated in
FIG. 2
, the work of the CPU
10
is reduced to the operating system tasks, such as scheduling processes and allocating system resources and time. Most of the data processing is performed within the memory
15
. By having a very large number of connections between the main memory
12
and the processing resources, i.e., the array
14
of PEs
200
, the bandwidth for moving data in and out of memory
12
is greatly increased. A large number of parallel processors can be connected to the memory
12
and can operate on their own area of memory independently. Together these two features can provide very high performance.
There are several different topologies for parallel processors. One example topology is commonly referred to as SIMD (single instruction, multiple data). The SIMD topology contains many processors, all executing the same stream of instructions simultaneously, but on their own (locally stored) data. The active memory approach is typified by SIMD massively parallel processor (MPP) architectures. In the SIMD MPP, a very large number (for example, one thousand) of relatively simple PEs
200
are closely connected to a memory and organized so that each PE
200
has access to its own piece of memory. All of the PEs
200
execute the same instruction together, but on different data.
The SIMD MPP has the advantage that the control overheads of the system are kept to a minimum, while maximizing the processing and memory access bandwidths. SIMD MPPs, therefore, have the potential to provide very high performance very efficiently. Moreover, the hardware consists of many fairly simple repeating elements. Since the PEs
200
are quite small in comparison to a reduced instruction set computer (RISC), they are easy to implement into a system design and their benefit with respect to optimization is multiplied by the number of processing elements. In addition, because the PEs
200
are simple, it is possible to clock them fast and without resorting to deep pipelines.
In a massively parallel processor array, the design of the interconnections among the processing elements and the interconnections between the PEs
200
and the memory
12
are an important feature. Traditional massively parallel processors utilize a plurality of semiconductor chips for the processor element array
14
and the memory
12
. The chips are connected via a simple network of wires. However, as shown in
FIG. 3
, advances in semiconductor technology now permits a SIMD massively parallel processor with a memory to be integrated onto a single active memory chip
100
. Since signals which are routed within a semiconductor chip can travel significantly faster than inter-chip signals, the single chip active memory
100
has the potential of operating significantly faster than a prior art SIMD MPP. However, achieving high speed operation requires more than merely integrating the elements of a traditional prior art SIMD MPP into one active memory chip
100
. For example, careful consideration must be given to the way the PEs
200
of the PE array
14
are wired together, since this affects the length of the interconnections between the PEs
200
(thereby affecting device speed), the mapping of the memory from as seen by the PEs
200
, the power consumed to drive the interconnection network, and the cost of the active memory chip
100
. Accordingly, there is a desire and need for an affordable high speed SIMD MPP active memory chip with an optimized interconnection arrangement between the PEs.
SUMMARY OF THE INVENTION
In one aspect, the present invention is directed to a single chip active memory with a SIMD MPP. The active memory chip contains a full word interface, a memory in the form of a plurality of memory stripes, and a PE array in the form of a plurality of PE sub-arrays. The memory stripes are arranged between and coupled to both the plurality of PE sub-arrays and the full word interface. Each PE sub-array is coupled to the full word interface and a corresponding memory stripe. In order to route the numerous couplings between a memory stripe and its corresponding PE sub-array, the PE sub-array is placed so that its data path is orthogonal to the orientation of the memory stripes. The data lines of the PE sub-arrays are formed on one metal layer and coupled to the memory stripe data lines which are formed on a different metal layer having an orthogonal orientation.
In another aspect of the present invention, the PEs each contain a small register file constructed as a small DRAM array. Small DRAM arrays are sufficiently fast to serve as a register file and utilize less power and semiconduc
Dickstein , Shapiro, Morin & Oshinsky, LLP
Pan Daniel H.
LandOfFree
Single instruction multiple data massively parallel... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Single instruction multiple data massively parallel..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Single instruction multiple data massively parallel... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3364326