Electrical computers: arithmetic processing and calculating – Electrical digital calculating computer – Particular function performed
Reexamination Certificate
1999-07-07
2003-09-16
Mai, Tan V. (Department: 2124)
Electrical computers: arithmetic processing and calculating
Electrical digital calculating computer
Particular function performed
Reexamination Certificate
active
06622153
ABSTRACT:
FIELD OF THE INVENTION
The present invention is directed to a multiplier-accumulator (MAC) and more particularly, to a virtual parallel multiplier-accumulator (VMAC) that processes more than or less than one MAC operations within a single system clock cycle.
BACKGROUND OF INVENTION
A multiply-accumulate (MAC) operation is a common operation performed in signal processing and other algorithms. Because of its frequency of occurrence in such algorithms, many prior art microprocessor and digital signal processors (DSPs) include some form of direct instruction support for the multiply-accumulate operation. Typically, the CPU's instruction set includes a multiply-accumulate instruction or multiply and add instructions that, together, can execute a MAC operation in a single system clock cycle. These instructions are executed by hardware circuits such as separate multiplier and adder circuits, or a combined multiply-add circuit.
Algorithms that use MAC operations typically consist of a loop over many iterations. The algorithm's performance can be improved by executing the MAC operations of multiple loop iterations at once. This property has motivated CPU designers to include instructions that execute multiple MAC operations per system clock cycle. An instruction executing multiple MAC operations per system clock cycle may be implemented in a number of ways. For example, hardware may be provided to execute multiple MAC operations per cycle consisting of a number of multipliers and adders or a number of multiply-add circuits. By providing multiple arithmetic circuits, the CPU can execute the simultaneous multiplies and adds needed to support multiple MAC operations in parallel.
Microprocessor integrated circuits may include a plurality of multiplier-accumulator (MAC) units connected in parallel with each other. While this configuration provides the ability to perform multiple MAC operations within a single system clock cycle, it also consumes more real estate within the integrated circuit, and adversely affects the performance and power consumption of the integrated circuit due to the relatively long bus connections between multi-port memories, registers, and the multiple MAC units.
An example of a prior art CPU data path executing two MAC operations per cycle is depicted in FIG.
1
. Each MAC unit defines a data path which consists of a register file comprised of sixteen, 40-bit registers, each having a multiplier and a load/store/arithmetic unit attached thereto. The multipliers each multiply two 16-bit operands to produce a 32-bit product. The multipliers can accept a new operand and produce a new product every system clock cycle, but have a latency of two system clock cycles. The load/store/arithmetic units can perform a 40-bit accumulate (i.e., addition/subtraction) in a single system clock cycle. The multiple MAC units are identical to each other, and provide an effective throughput of two multiply-accumulates per system clock cycle. Performing a complete multiply-accumulate operation requires passing the operands through a multiplier by issuing a multiply instruction, and then through a load/store/arithmetic unit by issuing an add instruction. The multiply and add instructions are scheduled for execution so that the product of the multiply operation is not used by the add operation until the multiplier has finished generating the product.
A prior art dual MAC data path is depicted in
FIG. 1
, and a timing diagram for that MAC is depicted in FIG.
2
. The timing diagram depicted in
FIG. 2
represents the timing for one of the components of the data path of
FIG. 1
, with the timing diagram for the other component of the data path being substantially similar. In operation, the first two multiply operands are read from a register file (REG FILE A) during Cycle
1
on signal lines DI_M
1
S
1
and DI_M
1
S
2
. The values of these first operands are determined by the data stored at the corresponding register addresses, e.g., register file A source
1
(REGS
1
A-
1
) and register file A source
2
(REGS
2
A-
1
). These first operands are communicated to multiplier M
1
, which begins a multiply operation on the two operands. In Cycle
2
, a second set of operands is read from the register file (REGS
1
A-
2
and REGIS
2
A-
2
) and communicated to the multiplier M
1
, which beings a multiply operation. At the same time, the multiplier M
1
finishes its multiply operation on the first operands and generates a first output product PROD
1
-
1
which is output on signal line PS_M
1
D. The first output product is communicated to register file A at the end of cycle
2
. During Cycle
3
, the first product that was generated, PROD
1
-
1
, is read from register file A on signal line PS_L
1
S
1
and communicated to the load store arithmetic unit L
1
as a first operand. The second operand to be accumulated by L
1
is the value denoted ACC
1
-
1
and is read from register file A on signal line PS_L
1
S
2
. The sum of the accumulation operation performed by L
1
on PROD
1
-
1
and ACC
1
-
1
, designated as SUM
1
-
1
, is written to register file A at the end of Cycle
3
over signal line RA_L
1
D. Also, during this cycle, a second product PROD
1
-
2
is generated by the multiplier M
1
and written to register file A. Similarly, third operands are read from register file A (REGS
1
A-
3
and REGS
2
A-
3
) and communicated to the multiplier M
1
, which begins a multiply operation on the third operands. During cycles
4
,
5
and
6
, successive products are accumulated by L
1
and additional products are generated by M
1
. When finished, the two mirror components of the prior art MAC data path have each accumulated the sum of an independent sequence of products. If the sum of those two sequences is needed, an additional accumulation instruction is issued to add the two sums.
It is common in CPU designs to increase the CPU clock frequency by processing instruction execution in a pipeline. The flow of instructions and their operands and results through the pipeline is controlled by the CPU's pipeline control logic. For CPUs that do not support a MAC operation, the duration of a pipeline stage (and therefore the clock frequency) is typically determined by the adder circuit or the delay to access memory. For CPUs that support MAC operations, the duration of a pipeline stage is often determined by the multiplier/adder/multiply-add circuit, i.e. by the hardware provided to perform the MAC operation. To overcome this limitation, prior art CPUs extend the pipeline by pipelining the multiplier/adder/multiply-add arithmetic circuits. Although the arithmetic circuits are pipelined with a fixed number of stages, pipelining still introduces significant complexity both in the design of the pipeline control logic and in writing a sequence of instructions to handle the latency of the pipeline. Ideally, the MAC operation should be executed with an arithmetic circuit that does not constrain the CPU's clock frequency and does not introduce complex latencies for the programmer to manage.
The prior art dual MAC data path has a number of disadvantages. Firstly, two multipliers and two adders are required. Secondly, the clock frequency of the dual MAC data path is restricted by the multiplier's delay; the multiplier already being pipelined once in an attempt to deal with its impact on the system frequency. However, this pipeline then requires extra circuit area, power and latency if the product is immediately re-used in a subsequent multiplication. Finally, the prior art dual MAC data path does not produce a single sum of all four products and the data-path has to be partitioned into mirror components to reduce the pressure on register file ports and bus loading. However, this means that the data path does not directly sum a sequence of products in half the number of cycles, and an additional cycle is needed to add the final sums.
It is desirable to provide a MAC unit that overcomes the shortcomings of the prior art.
SUMMARY OF THE INVENTION
The present invention is directed to a virtual parallel m
Lee Hyun
Whalen Shaun P.
Agere Systems Inc.
Mai Tan V.
Stroock & Stroock & Lavan LLP
LandOfFree
Virtual parallel multiplier-accumulator does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Virtual parallel multiplier-accumulator, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Virtual parallel multiplier-accumulator will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3093242