Electrical computers: arithmetic processing and calculating – Electrical digital calculating computer – Particular function performed
Reexamination Certificate
1999-10-08
2002-03-05
Malzahn, David H. (Department: 2121)
Electrical computers: arithmetic processing and calculating
Electrical digital calculating computer
Particular function performed
Reexamination Certificate
active
06353843
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to the field of hardware used for implementing arithmetic operations such as processor instructions. More specifically, the present invention relates to a multiplier circuit capable of performing operations on operands of various data types and also for signed and non-signed binary values.
2. Related Art
Hardware multipliers are an indispensable component of every computer system, cellular phone and most digital audio/video equipment. In real-time applications (e.g., flight simulators, speech recognition, video teleconferencing, computer games, streaming audio/video etc.), the overall system performance is heavily dependent on the speed of the internal multipliers. For instance, processing digital images at 30 frames/second requires nearly 2.2 million multiply operations per second. Therefore, designing fast multipliers that occupy smaller areas on the integrated circuit (IC) chip and that consume less power is essential to a successful product.
In multimedia applications, multipliers are used to perform a wide range of functions such as Inverse Discrete Cosine (IDCT), Fast Fourier Transforms (FFT), and Multiply Accumulate (MAC) on 8-bit, 16-bit, and 32-bit signed and unsigned operands. It would be advantageous to provide a multiplier device which can support a variety of data formats. One effort to produce a multiplier that can support a variety of data formats resulted in multi-cycle multipliers.
FIG. 1
illustrates the operation
10
of a multi-cycle multiplier of the prior art. In the multi-cycle multiplier, a smaller multiplier circuit (e.g., 8×8 bit) is used to compute partial products (e.g., step
12
) which are accumulated together (e.g., step
14
) to form the final result. The multi-cycle or “iterative” method uses a basic multiplier to perform the multiplication for larger word lengths. This method does not allow high throughput for large word lengths, and although it may result in a shorter delay for 8-bit operations, the extra cycles to perform 16-bit and 32-bit operations result in serious side effects such as longer delay, more wiring, bypassing, and unwanted stalls in the pipeline. Table I shows the number of clock cycles needed for partial product reduction using a typical 8×8 bit multiplier circuit for performing 8-bit, 16-bit, and 32-bit multiplications.
TABLE 1
Necessary Number of Cycles
Operand Size
(also called Cycle Latency)
8-bit
1
cycle
16-bit
2
cycles
32-bit
4
cycles
As discussed above, there are numerous disadvantageous with the prior art multi-cycle multiplier approach, such as, larger cycle latency, smaller throughput, and perhaps worst of all, different timing delays for different data formats, which would result in creating stalls in the pipeline when dealing with wider numbers.
Recently, Hideyuki proposed in a reference entitled, “Matrix Vector Multiplier (MVM) Dedicated to Video Decoding and 3-D Computer Graphics,” by Hideyuki et., al., IEEE Transactions on Circuits and Systems for Video Technology, Volume: 9,2, March 1999, pages 306-314, the matrix vector multiplier (MVM) dedicated to video decoding and 3-D computer graphics. This multiplier supports multiple operations on 16-bit and 32-bit unsigned operands using only one multiplier, at the cost of a very low speed 20 MHz. Like other multipliers using the iterative method, many extra cycles are required to perform the 32-bit multiply operations which reduces the overall performance of this device. It would be advantageous to provide a multiplier circuit design that could support a variety of data formats (e.g., lengths) without consuming extra cycles for multiply operations on larger operands.
An Intel design is described in a reference entitled, “A 600 MHz IA-32 Microprocessor with Enhanced Data Streaming for Graphics and Video,” by Stephen Fischer, Digest of Technical Papers, ISSCC 1999, pages 98-450. In this design approach, two separate hardware multipliers are used to perform two 16×16 bit multiplications. Since these multipliers are not partitioned, this approach does not allow the flexibility to use these multipliers for a variety of data formats and the duplication of circuitry consumes large amounts of area and consumes large amounts of power. Moreover, extra cycles are required to perform 32-bit operations because the iterative method is required for operands larger than 16-bits. Lastly, this design does not allow much parallelism for 8-bit operations.
The second prior art method for performing multiplication that supports a variety of data formats uses separate hardware for different data types. For instance, a separate 32×32 bit multiplier circuit, a separate 16×16 bit multiplier circuit and a separate 8×8 bit multiplier circuit are included within a single multiplier device. However, using separate hardware for different data types can become extremely costly because it requires large amounts of chip area and consumes more power.
An AltiVec design by Motorola is described in a paper entitled, “A Low Power, High Speed Implementation of a PowerPC Microprocessor Vector Extension,” by Martin S. Schmookler et. al., presented at 14th IEEE Symposium on Computer Arithmetic, 1999. This is the first architecture which supports multiplication on 8-bit and 16-bit signed and unsigned operands. However, like the Intel design described above, this prior art design uses redundant/separate hardware for performing 8-bit and 16-bit multiplications. It would be advantageous to provide a multiplier circuit design that could support a variety of data formats without consuming large amounts of area and power.
SUMMARY OF THE INVENTION
Accordingly, the present invention provides a multiplier design that accepts a large variety of data formats but does not require iterative steps (e.g., multi-cycling) to perform large operand multiplication thereby providing very fast operational performance. The present invention advantageously provides constant cycle latency for any operand size from 8-bit, 16-bit and 32-bit and does not perform multiplier multi-cycling for larger operands. Further, the present invention provides a multiplier design that accepts a large variety of data formats but does not utilize multiplier circuitry duplication thereby providing a hardware efficient and energy efficient device.
A partitioned multiplier circuit is described herein which is designed for high speed operations. The multiplier of the present invention can perform one 32×32 bit multiplication, two 16×16 bit multiplications (simultaneously) or four 8×8 bit multiplications (simultaneously) depending on input partitioning signals. The time required to perform either the 32×32 bit or the 16×16 bit or the 8×8 bit multiplications is the same due to the design of the present invention. Multiplication results are available with a constant latency (e.g., two clock cycles in one embodiment) regardless of the operand bit-size. In the embodiment that requires two clock cycle latency, the multiplier circuit has a throughput of one clock cycle due to pipelining. The input operands can be signed or unsigned. The hardware is partitioned without any significant increase in the delay or area and the multiplier can provide six different modes of operation. In one embodiment, Booth encoding is used for the generation of 17 partial products which are compressed using a compression tree into two 64-bit values. This is performed in the first pipeline stage to generate a 64-bit sum vector and a 64-bit carry vector. These values are then added, in the second pipestage, using a carry propagate adder circuit to provide a single 64-bit result. In the case of 16×16 bit multiplication, the 64-bit result contains two 32-bit results. In the case of 8×8 bit multiplication, the 64-bit result contains four 16-bit results. Due to its high operating speed, the multiplier circuit is advantageous for use in multi-media applications, such as audio/visual rendering and playback.
More specific
Chehrazi Farzad
Farooqui Aamir A.
Oklobdzija Vojin G.
Malzahn David H.
Sony Corporation of Japan
Wagner , Murabito & Hao LLP
LandOfFree
High performance universal multiplier circuit does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with High performance universal multiplier circuit, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and High performance universal multiplier circuit will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2850053