Electrical computers: arithmetic processing and calculating – Electrical digital calculating computer – Particular function performed
Reexamination Certificate
2001-07-31
2004-11-02
Ngo, Chuong Dinh (Department: 2124)
Electrical computers: arithmetic processing and calculating
Electrical digital calculating computer
Particular function performed
C708S518000
Reexamination Certificate
active
06813627
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to performing integer multiply operations in computer systems. More specifically, the present invention relates to performing integer multiply operations using instructions that perform several smaller multiply operations in parallel.
DESCRIPTION OF THE RELATED ART
In the art of computing, central processing units (CPUs) perform tasks by executing instructions that are part of an instruction set. Some of these instructions are dedicated to performing basic mathematical operations, including integer multiply operations.
The operations performed by instructions are implemented by logic gates on an integrated circuit (IC) die. The logic gates required to implement some operations, such as integer addition operations, tend to consume a relatively small area of the die. On the other hand, the logic gates required to implement integer multiply operations tend to consume a significantly larger area of the die. Accordingly, it is important to optimize the design of the circuits that perform integer multiply operations to minimize the die area consumed by these circuits.
CPUs typically have two types of functional units for performing mathematical operations. The first type of functional unit is the integer unit, which is responsible for performing integer (or alternatively, fixed-point) mathematical operations. The second type of functional unit is the floating-point unit, which is responsible for performing floating-point operations. The two functional units typically reside on distinct areas of the die, and each functional unit typically has access to its own register file. Separating the two functional units allows each unit to be optimized to perform the functions it supports.
Furthermore, there is typically little interaction between the integer and floating-point units, so there is little penalty incurred by separating the units.
Historically, integer multiplication has been considered important enough, from a performance perspective, to provide instructions in the instruction set that explicitly support integer multiply operations. However, integer multiplication has traditionally not been considered important enough to provide a full implementation of a 32-bit or 64bit integer multiplier in the integer unit, especially in reduced instruction set computer (RISC) CPUs. As discussed above, such an integer multiplier unit consumes a large area on the die, and this die area can typically be better used to provide other functions.
One prior art technique for supporting integer multiply instructions is to provide a smaller integer multiplier (such as an 8-bit or 16-bit multiplier) in the integer unit. The smaller multiplier computes sums of smaller products to produce a 32-bit or 64-bit results, and uses multiple cycles to compute the result. This approach has the advantage of consuming a relatively small area on the die. However, the smaller multiplier is nonetheless only useful for performing integer multiply operations, and is relatively slow. One CPU that uses this approach is the MIPS® R3000® RISC processor, which is a product of MIPS Technologies, Inc.
Another prior art technique is to use the floating-point unit to perform integer multiply operations. Typically this approach requires that a data path be provided between the integer register file and the floating-point register file. To perform an integer multiply operation, the operands are transferred from the integer register file to the floating-point register file via the data path, a multiplier in the floating-point unit is used to perform the integer multiply operation using operands from and storing the result to the floating-point register file, and the result is transferred from the floating-point register file back to the integer register file. This approach is used by CPUs adhering to the PA-RISC architecture, which are products of the Hewlett-Packard Company, and CPUs adhering to the IA-64 architecture, which are products of Intel Corporation. The IA-64 architecture was developed jointly by Hewlett-Packard Company and Intel Corporation.
This approach has the advantage of using existing multiplier circuits in the floating-point unit, so little extra area on the die is required. Furthermore, floating-point units typically include full multiplier implementations capable of performing 32-bit or 64-bit multiply operations in relatively few clock cycles. However, this approach also has several disadvantages. Since the integer and floating-point units are designed independently, each unit is optimized for its own operations and the data path between the two units is often not very fast. Another disadvantage is that floating-point registers, which could be used to perform other tasks, are needed for intermediate computation. Another disadvantage of using the floating-point unit is power. The floating-point unit typically uses a lot of power, and if a program does no real floating point work, many modern processors power down the floating-point unit. Thus, powering the floating-point unit up for an occasional integer multiply operation consumes significant power.
Code Segment A illustrates how an integer multiply operation is typically performed in a CPU adhering to the IA-64 architecture. In Code Segment A, the integers to be multiplied are stored in registers r
32
and r
33
, and the result is placed in r
34
.
Code Segment A
1:
setf.sig
f6 = r32
2:
setf.sig
f7 = r33
3:
xmpy.1
f6 = f6, f7
4:
getf.sig
r34 = f6
The instructions shown in Code Segment A are discussed in greater detail in the Intel®IA-64 Architecture Software Developer's Manual, Volume 3: Instruction Set Reference, Revision 1.1, which was published in July of 2000 and is hereby incorporated by reference. Furthermore, the latencies associated with these instructions on an Itanium™ CPU are discussed in the Itanium™ Processor Microarchitecture Reference for Software Optimization, which was published in August 2000 and is hereby incorporated by reference. The Itanium™ processor is the first CPU to adhere to the IA-64 architecture.
Returning to Code Segment A, at line
1
the instruction “setf.sig” is used to transfer the contents of general register
32
(r
32
) to the significand field of floating point register
6
(f
6
). Similarly, at line
2
the contents of r
33
are transferred to the significand field of f
7
. The “setf.sig” instructions of lines
1
and
2
can be issued during the same clock cycle, and have a latency of nine cycles. Accordingly, if the “xmpy.1” instruction of line
3
is scheduled closer than nine cycles from the “setf.sig” instructions, the pipeline will delay execution of the “xmpy.1” instruction until nine cycles have elapsed.
At line
3
, the instruction “xmpy.1” instruction treats the contents of the significand fields of f
6
and f
7
as signed integers, and multiplies the contents together to produce a full 128-bit signed result, with the least significant 64-bits of the result being stored in the significand field of f
6
. The “xmpy.1” instruction has a latency of eight cycles, so if the “getf.sig” instruction of line
4
is scheduled closer than seven cycles from the “xmpy.1” instruction, the pipeline will delay execution of the “getf.sig” instruction until seven cycles have elapsed.
Finally, the “getf.sig” instruction of line
4
transfers the significand field of f
6
to r
34
. The “getf.sig” instruction has a latency of two cycles, after which the result of the multiply operation is available in r
34
.
Note that the integer multiply operation shown of Code Segment A has a total latency of 19 cycles, which is relatively slow. Although the integer multiply operation has a relatively long latency, many multiply operations can be pending in the pipeline, thereby allowing a multiplication result to be generated every few cycles.
This latency is not an issue for applications that perform many integer multiply operations in a sequence. In such applications, modulo scheduling allows the pipeline to be loaded with many multiply operations, thereby hi
Hull James M.
Morris Dale C.
Ngo Chuong Dinh
Plettner David A.
LandOfFree
Method and apparatus for performing integer multiply... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for performing integer multiply..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for performing integer multiply... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3344957