Fast parallel multiplier implemented with improved tree...

Electrical computers: arithmetic processing and calculating – Electrical digital calculating computer – Particular function performed

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

Reexamination Certificate

active

06490608

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to the field of hardware used for performing computer arithmetic operations. More specifically, the present invention pertains to circuits capable of performing arithmetic operations on operands of signed and non-signed binary values.
BACKGROUND OF THE INVENTION
Hardware circuits for performing arithmetic operations are indispensable in every computer system, cellular phone and most digital audio/video equipment. In real-time applications (e.g., simulations, speech recognition, video teleconferencing, computer games, streaming audio/video etc.), the overall system performance is heavily dependent on the speed of the multipliers. For instance, processing digital images at 30 frames/second requires nearly 2.2 million multiplication operations per second. These systems require multiplication operations be performed at a very high speed.
A technique that is commonly used in high-speed multiplier is carry-save addition. In carry-save addition, a carry-save adder (CSA) accepts three n-bit operands and generates two n-bit results which include an n-bit partial sum, and an n-bit carry. A second CSA accepts these two n-bit results and another n-bit input operand, and generates a new partial sum and carry. A CSA is therefore, capable of reducing the number of operands to be added from 3 to 2 without any carry propagation. When the number of operands is eventually reduced to 2, the operands can be added together with a carry propagate adder (CPA). Carry save addition has the advantage of minimizing delay caused by carry propagation.
FIG. 1
illustrates a parallel 5-bit-by-5-bit multiplier
100
, which is comprised of CSAs
110
,
120
and
130
and CPA
140
. CSAs are made up of full adders (FA) and half adders (HA). As shown in
FIG. 1
, each multiplicand multiple is denoted by MP
i
j
which is obtained as the logical-AND of bit i of the multiplier and bit j of the multiplicand. The first-level CSA
110
is used for the first three multiplicand multiples, and at the next level, CSA
120
receives the outputs of the first level CSA
110
and adds to them the fourth multiplicand multiples. The outputs of the second level CSA
120
are provided to the third level CSA
130
adds the fifth multiplicand multiples. The outputs of the third level CSA
130
are two numbers which are then summed together by CPA
140
to generate the final product. This type of parallel multiplier is also commonly referred to as parallel-array multipliers:
Another class of parallel multiplier, proposed by C.S. Wallace in 1963, and commonly referred to as Wallace tree multipliers or simultaneous multipliers, are also commonly used. A block diagram of an exemplary six-operand Wallace tree multiplier
200
is illustrated in FIG.
2
. As shown in
FIG. 2
, Wallace tree multiplier
200
includes CSAs
210
,
220
,
230
and
240
and CPA
250
. The first level CSAs
210
and
220
receive the six operands and reduce them two four outputs. The second level CSA
230
receives the shifted carry output and sum output of the CSA
220
and the sum output of CSA
210
and generates two outputs, which are added to the shifted carry output of CSA
210
to generate two numbers. The two numbers which are added together by CPA
250
to generate the final product. In general, a Wallace tree multiplier has fewer levels (or stages) of carry-save adders than a straight forward parallel-array multiplier of the same width, and thus, better performance. A discussion of the parallel array multipliers and Wallace tree multipliers can be found in A. R. Omondi, “Computer Arithmetic Systems: Algorithms, Architecture and Implementation,” pp. 160-170, 1994.
A modification of the Wallace tree multiplier, proposed by L. Dadda in 1965 and generally called a Dadda multiplier, has less logic. There also exist a number of parallel multipliers in which other counters are used to reduce the depth of the tree. This, however, does not necessarily result in proportionate improvement in performance because the increase in the complexity of the counters also means an increase delay through them.
As the power of computer systems, cellular phones, digital A/V equipment etc. continues to increase, the demand for faster multipliers is likely to rise. At the same time, systems are also becoming increasingly integrated. Designing fast multipliers that occupy smaller areas on the integrated circuit (IC) chip is critical to a commercially successful product. Therefore, what is needed is a multiplier circuit that is faster and smaller. What is further needed is a method of reducing reduction trees such that the resulting parallel multiplier circuit uses fewer levels of CSA components than conventional multipliers.
SUMMARY OF THE DISCLOSURE
Accordingly, the present invention provides parallel multipliers that require one fewer stage of reduction than conventional multipliers. Accordingly, fewer adder components are required. The speed of the parallel multipliers of the present invention is also improved due to the fewer number of reduction stages.
The present invention also provides a method of constructing reduction trees in parallel multipliers to achieve fewer reduction stages. In one embodiment, the method of the present invention is applicable to signed multiplication and includes a step of generating an input matrix based on the operand sizes, data types, etc., of the multiplier; and determining a maximum number of nodes per column (K) of the input matrix. The method of the present embodiment also includes the step of generating a series of maximum stage capacity numbers &dgr;
N
including &dgr;
L-2
, &dgr;
L-1
, and &dgr;
L
where &dgr;
L-1
≦K≦&dgr;
L
. The method of the present embodiment then performs a step of trial reduction on an input matrix. The trial reduction step is performed in an attempt to reduce the maximum number of nodes per column of the input matrix to less than or equal to &dgr;
L-2
. If the trial reduction step is successful in reducing the input matrix, then reduction steps are performed to generate a reduction tree that has L-
1
reduction stages. Otherwise, reduction steps are performed to generate a reduction tree that has L stages.
In comparison to conventional reduction methods, such as those proposed by Wallace and Dadda, the present invention generates reduction trees that may use one fewer level of reduction stages for the same operands. Multipliers having fewer reduction stages would therefore have a smaller delay and may also require a smaller die area.
Further, in one embodiment, the method of the present invention includes a step of “squeezing” as many LSBs (least significant bits) of the output matrix of the reduction tree to at most one node. According to the present invention, the single nodes at the output matrix can be directly provided to the output of the multiplier without going through the final adder. Thus, a smaller final adder would be needed. Benefits of the “squeezing” step include (1) speeding up the circuit and (2) reducing the size of the circuit. In furtherance of one embodiment of the invention, the bit adjacent to the “squeezed” LSBs may have up to three nodes. Significant savings in delay and die area can be achieved because the third node may be provided to a carry input of the final adder.
Embodiments of the present invention include the above and further include a computer-readable medium having contained therein computer-readable codes for causing a computer-implemented logic synthesis system to perform a method of constructing a parallel multiplier. The method of the present embodiment includes the steps of: (a) accessing an input matrix; (b) determining a depth K of the input matrix; (c) calculating maximum stage capacity numbers for the input matrix. In the present embodiment, the maximum stage capacity numbers includes {&dgr;
N
} where &dgr;
N
=[&dgr;
0
, . . . &dgr;
L-2
, &dgr;
L-1
, &dgr;
L
]. In accordance with the present embodiment, &dgr;
L-1
denotes a maximum stage capacity at a stage (L-
1
) and

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Fast parallel multiplier implemented with improved tree... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Fast parallel multiplier implemented with improved tree..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Fast parallel multiplier implemented with improved tree... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2982275

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.