Electrical computers and digital processing systems: processing – Processing control – Arithmetic operation instruction processing
Reexamination Certificate
1998-12-30
2001-11-20
Chan, Eddie (Department: 2783)
Electrical computers and digital processing systems: processing
Processing control
Arithmetic operation instruction processing
C712S022000
Reexamination Certificate
active
06321327
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates to systems for processing data and, in particular, to systems for processing data through single-instruction multiple data (SIMD) operations.
2. Background Art
Processor designers are always looking for ways to enhance the performance of microprocessors. Processing multiple operands in parallel provides one avenue for gaining additional performance from today's highly optimized processors. In certain common mathematical calculations and graphics operations, the same operation(s) is performed repeatedly on each of a large number of operands. For example, in matrix multiplication, the row elements of a first matrix are multiplied by corresponding column elements of a second matrix and the resulting products are summed (multiply-accumulate). By providing appropriate scheduling and execution resources, multiply-accumulate operations may be implemented concurrently on multiple sets of row-column operands. This approach is known as vector processing or single instruction, multiple data stream (SIMD) processing to distinguish it from scalar or single instruction, single data stream (SISD) processing.
In order to implement SIMD operations efficiently, data is typically provided to the execution resources in a “packed” data format For example, a 64-bit processor may operate on a packed data block, which includes two 32-bit operands. In this example, a vector multiply-accumulate instruction, V-FMA (f
1
, f
2
, f
3
), multiplies each of a pair of 32-bit operands stored in register f
1
with a corresponding pair of 32-bit entries stored in register f
2
and adds the resulting products to a pair of running sums stored in register f
3
. In other words, data is stored in the registers f
1
, f
2
, and f
3
in a packed format that provides two operands from each register entry. If the processor has sufficient resources, it may process two or more packed data blocks, e.g. four or more 32-bit operands, concurrently. The 32 bit operands are routed to different execution units for processing in parallel and subsequently repacked, if necessary.
Even in graphics-intensive and scientific programming, not all operations are SIMD operations. Much of the software executed by general-purpose processors comprises instructions that perform scalar operations. That is, each source register specified by an instruction stores one operand, and each target register specified by the instruction receives one operand. In the above example, a scalar floating-point multiply-accumulate instruction, S-FMA (f
1
, f
2
, f
3
), may multiply a single 64-bit operand stored in register f
1
with corresponding 64-bit operand stored in register f
2
and add the product to a running sum stored in register f
3
. Each operand processed by the S-FMA instruction is provided to the FMAC unit in an unpacked format.
The register file that provides source operands to and receive results from the execution units consume significant amounts of a processor's die area. Available die area is a scarce resource on most processor chips. For this reason, processors typically include one register file for each major data type. For example, a processor typically has one floating-point register file that stores both packed and unpacked floating-point operands. Consequently, packed and unpacked operands are designed to fit in the same sized register entries, despite the fact that a packed operand includes two or more component operands.
Providing execution resources for packed and unpacked operands creates performance/cost challenges. One way to provide high performance scalar and vector processing is to include separate scalar and vector execution units. An advantage of this approach is that the vector and scalar execution units can each be optimized to process data in its corresponding format, i.e. packed and unpacked, respectively. The problem with this approach is that the additional execution units consume silicon die area, which is a relatively precious commodity.
In addition to providing appropriate execution resources, high performance processors must include mechanisms for transferring both packed and unpacked operand data efficiently. These mechanisms include those that transfer operand data to the register file from the processor's memory hierarchy, e.g. caches, and those that transfer operand data from the register file to the execution resources.
The present invention addresses these and other problems with currently available SIMD systems.
SUMMARY OF THE INVENTION
A system is provided that supports efficient processing of a floating point operand by setting an implicit bit for the operand “on-the-fly”, i.e. as the operand is loaded into a register file entry.
In accordance with the present invention, a floating-point operand is retrieved for loading into a register file entry. Selected bits of the floating-point operand are tested, and an implicit bit associated with the register file entry is set when the selected bits are in a first state.
For one embodiment of the invention, the floating-point operand is a packed operand that includes two or more component operands, and the register file entry includes an implicit bit for each component operand. The implicit bit for a component operand is set when the selected bits indicate that the component operand is normalized.
The present invention thus allows the normal/denormal status of an operand to be determined when the operand is loaded into the register file and tracked through an implicit bit associated with the corresponding register file entry. This eliminates the need for status-determining logic in the operand delivery module, which transfers the operand from the register file to the execution unit. Since the operand delivery module is on a critical (bypass) path for the execution unit, processor performance may be significantly improved.
REFERENCES:
patent: 4595911 (1986-06-01), Kregness et al.
patent: 5063497 (1991-11-01), Cutler et al.
patent: 5278945 (1994-01-01), Basehore et al.
patent: 5450607 (1995-09-01), Kowalczyk et al.
patent: 5487022 (1996-01-01), Simpson et al.
patent: 5668984 (1997-06-01), Taborn et al.
patent: 5675777 (1997-10-01), Glickman
patent: 5701508 (1997-12-01), Glew et al.
patent: 5751987 (1998-05-01), Mahant-Shetti et al.
patent: 5761103 (1998-06-01), Oakland et al.
patent: 5768169 (1998-06-01), Sharangpani
patent: 5805475 (1998-09-01), Putrino et al.
patent: 5825678 (1998-10-01), Smith
patent: 5995122 (1999-11-01), Hsieh et al.
patent: 6009511 (1999-12-01), Lynch et al.
patent: 6131104 (2000-10-01), Oberman
patent: WO97/22923 (1997-06-01), None
patent: WO98/57254 (1998-12-01), None
Standards committee of the IEEE computer society (IEEE Standard for Binary Floating-Point Arithmetic) Mar. 21, 1985. pp. 1 and 7.
Doshi Gautam B.
Golliver Roger A.
Kimn Sunnhyuk
Makineni Sivakumar
Chan Eddie
Intel Corporation
Novakoski Leo V.
Patel Gautam R.
LandOfFree
Method for setting a bit associated with each component of... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method for setting a bit associated with each component of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for setting a bit associated with each component of... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2598946