System and method for performing compound vector operations

Electrical computers: arithmetic processing and calculating – Electrical digital calculating computer – Particular function performed

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C712S001000, C712S010000, C712S016000, C712S020000, C712S021000, C708S003000

Reexamination Certificate

active

06192384

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention is directed to computer architectures. More specifically, the invention is directed to pipelined and parallel processing computer systems which are designed to efficiently handle continuous streams of instructions and data.
2. Description of Related Art
Providing adequate instruction and data bandwidth is a key problem in modern computer systems. In a conventional scalar architecture, each arithmetic operation, e.g., an addition or multiplication, requires one word of instruction bandwidth to control the operation and three words of data bandwidth to provide the input data and to consume the result (two words for the operands and one word for the result). Thus, the raw bandwidth demand is four words per operation. Conventional architectures use a storage hierarchy consisting of register files and cache memories to provide much of this bandwidth; however, since arithmetic bandwidth scales with advances in technology, providing this instruction and data bandwidth at each level of the memory hierarchy, particularly the bottom, is a challenging problem.
Vector architectures have emerged as one approach to reducing the instruction bandwidth required for a computation. With convention vector architectures, e.g., the Cray-1, a single instruction word specifies a sequence of arithmetic operations, one on each element of a vector of inputs. For example, a vector addition instruction VADD VA, VB, VC causes each element of an, e.g., sixty-four element vector VA to be added to the corresponding element of a vector VB with the result being placed in the corresponding element of vector VC. Thus, to the extent that the computation being performed can be expressed in terms of vector operations, a vector architecture reduces the required instruction bandwidth by a factor of the vector length (sixty-four in the case of the Cray-1).
While vector architectures may alleviate some of the instruction bandwidth requirements, data bandwidth demands remain undiminished. Each arithmetic operation still requires three words of data bandwidth from a global storage source shared by all arithmetic units. In most vector architectures, this global storage resource is the vector register file. As the number of arithmetic units is increased, this register file becomes a bottleneck that limits further improvements in machine performance.
To reduce the latency of arithmetic operations, some vector architectures perform “chaining” of arithmetic operations. For example, consider performing the above vector addition operation and then performing the vector multiplication operation VMUL VC VD VE using the result. With chaining, the vector multiply instruction consumes the elements computed by the vector add instruction in VC as they are produced and without waiting for the entire vector add instruction to complete. Chaining, however, also does not diminish the demand for data bandwidth—each arithmetic operation still requires three words of bandwidth from the vector register file.
BRIEF SUMMARY OF THE INVENTION
In view of the above problems of the prior art, it is an object of the present invention to provide a data processing system and method which can provide a high level of performance without a correspondingly high memory bandwidth requirement.
It is another object of the present invention to provide a data processing system and method which can reduce global storage resource bandwidth requirements relative to a conventional scalar or vector processor.
It is a further object of the present invention to provide a parallel processing system and method which minimizes the number of external access operations each processor conducts.
It is yet another object of the present invention to provide a parallel processing system and method which utilizes granular levels of operation of a higher order than individual arithmetic operations.
It is still another object of the present invention to provide a parallel processing system and method which is capable of simultaneously exploiting multiple levels of parallelism within a computing process.
It is yet a further object of the present invention to provide a single-chip processing system which reduces the number of off-chip memory accesses.
The above objects are achieved according to a first aspect of the present invention by providing a processor having a tiered storage architecture to minimize global bandwidth requirements. The processor has a stream register file through which the processor's arithmetic units transfer streams to execute processor operations. Load and store instructions transfer streams between the stream register file and a stream memory; send and receive instructions transfer streams between stream register files of different processors; and operate instructions pass streams between the stream register file and computational kernels.
Each of the computational kernels is capable of performing compound vector operations. A compound vector operation performs a sequence of arithmetic operations on data read from the stream register file, i.e., a global storage resource, and generates a result that is written back to the stream register file. Each function or compound vector operation is specified by an instruction sequence that specifies the arithmetic operations and data movements that are performed each cycle to carry out the compound operation. This sequence can, for example, be specified using microcode.
Because intermediate results are forwarded directly between arithmetic units and not loaded from or stored to the stream register file, bandwidth demands on the stream register file are greatly reduced and global storage bandwidth requirements are minimized.
For example, consider the problem of performing a transformation on a sequence of points, a key operation in many graphics systems when, e.g., adjusting for perspective or moving from a model space to a world space. In its most basic form, the operation requires reading three words of data for each point (x, y, z), performing a 4×4 vector-matrix multiply, taking the reciprocal of a number, performing three multiplies, and writing the resulting point (x′, y′, z′) in the new coordinate system. Without optimizations, the perspective transformation requires thirty-two arithmetic operations for each point—nineteen multiplications, twelve additions and one reciprocal operation. On conventional vector architectures, this would require ninety-six words of vector register bandwidth per point.
In contrast, a compound vector architecture as described in greater detail below can perform the perspective transformation in a single operation. The compound vector operation requires only six words of global bandwidth storage per point: three words to read the coordinates of the original point (x, y, z) and three words to write the coordinates of the transformed point (x′, y′, z′). All of the intermediate results are forwarded directly between arithmetic units and thus do not require global storage bandwidth. This sixteen-fold reduction in vector register bandwidth greatly improves the scalability of the architecture. In effect, the compound vector architecture moves the vector register file access outside of a function such as perspective transformation.


REFERENCES:
patent: 4807183 (1989-02-01), Kung et al.
patent: 5327548 (1994-07-01), Hardell, Jr. et al.
patent: 5522083 (1996-05-01), Gove et al.
patent: 5692139 (1997-11-01), Slavenburg et al.
Rixner et al., “A bandwidth-efficient architrecture for media processor.” Proceedings on Annual ACM/IEEE International Symposium on Microarchitecure, p. 3-13, Nov., 1998.
Borkar, et al. “iWarp: an integrated solution to high-speed parallel computing.” Proceedings on Supercomputing, p. 330-339, Nov., 1988.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

System and method for performing compound vector operations does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with System and method for performing compound vector operations, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method for performing compound vector operations will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2606932

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.