Electrical computers: arithmetic processing and calculating – Electrical digital calculating computer – Particular function performed
Reexamination Certificate
1998-02-24
2001-03-13
Malzahn, David H. (Department: 2787)
Electrical computers: arithmetic processing and calculating
Electrical digital calculating computer
Particular function performed
C708S491000, C708S518000, C708S552000, C708S606000, C708S625000
Reexamination Certificate
active
06202077
ABSTRACT:
FIELD OF THE INVENTION
The present invention generally relates to vector processing systems, and more specifically to a set of extended precision operand formats for efficiently performing arithmetic operations in a single-instruction/multiple-data (SIMD) data processing system.
BACKGROUND OF THE INVENTION
In DSP architectures, it is common to have a multiply-accumulate instruction which can accumulate multiple products without loss of accuracy. For example, if the DSP has 16-bit input operands, the product will usually be 32-bits, and the DSP accumulator may hold 40 or 48-bits so that a number of products can be summed without having to worry about overflow or carry out of the high-order accumulator bit position.
Accumulators long enough to accumulate a number of products are much less common in general purpose computer architectures. One reason for this is that most such general purpose computer architectures utilize multiple named registers instead of the single global accumulator.
In multi-precision or extended precision multiplication operations, it is often desirable to cumulate many products without a loss of accuracy. One typical operation is to multiply two 16-bit numbers to generate a single 32-bit number. Another typical operation is to multiply two 32-bit numbers and generate two 32-bit partial products: one instruction will generate the most significant 32-bits of the full product; and another instruction will generate the least significant 32-bits of the full product. In these cases, carries and overflows are potentially problematic since each addition can potentially result in a carry or an overflow. This situation is ameliorated by an add-extended operation, in which a carry-bit can be used and set in a single addition operation.
In Single-Input/Multiple-Data (SIMD) architectures such as MMX, and VMX, things are often much more complicated. For example, the MMX and VMX architectures have multiply-sum instructions which take two registers (V
1
, V
2
) each containing eight 16-bit elements and a register (V
3
) containing four 32-bit elements and returns a register (V
4
) which contains four 32-bit elements. Let VM:N represent each 16-bit (V
1
/V
2
) or 32-bit (V
3
/V
4
) element in the four registers, with “M” representing the register number, and “N” representing the element within the register, starting from the right (0). Such a multiply-sum instruction computes:
V
4:0
=
V
1:0
*
V
2:0
+
V
1:1
*
V
2:1
+
V
3:0
V
4:1
=
V
1:2
*
V
2:2
+
V
1:3
*
V
2:3
+
V
3:1
V
4:2
=
V
1:4
*
V
2:4
+
V
1:5
*
V
2:5
+
V
3:2
V
4:3
=
V
1:6
*
V
2:6
+
V
1:7
*
V
2:7
+
V
3:3
This type of multiply/accumulate instruction can be very useful in certain applications, but there is an inherent difficulty in using the instruction if the full 16-bits of V
1
and V
2
are used since the sum of products V
1
:
0
*V
2
:
0
+V
1
:
1
*V
2
:
1
+V
3
:
0
may overflow. Additionally, in multi-precision computations, a large number of products often need to be added or accumulated, and a SIMD sequence for doing this computation often requires three instructions: an add to produce a carry; an add modular for the result; and an add accumulating the carry. These three instructions cumulatively take three clock cycles on many architectures.
There are many reasons to perform multiple-precision arithmetic. One extremely important use for this type of arithmetic these days is in cryptography, such as in performing modular exponentials such as are required in RSA. In computing these modular exponentials, there is a need for many computations of the form A*B mod C, where C is fixed, and many different A and B are utilized. In common scenarios, A, B, and C are either 512-bit or 1024-bit numbers.
FIG. 1
is a diagram illustrating extended precision multiplication. An extended precision multiplier
12
consists of four 8-element groups. If each element is a 16-bit quantity, then each group is a 128-bit extended word, and the multiplier
12
is a 512-bit operand. The multiplier
12
is multiplied by one eight-element group representing a 128-bit multiplicand
14
. Each element in the multiplier
12
and in the multiplicand
14
is labeled by an upper number indicating the group and a lower number representing the element within the group. The figure illustrates which elements are multiplied together to generate portions of the ultimate product
18
. Sixteen sets of partial products
16
are shown with each partial product aligned according to its contribution to the entire product. Each partial product is labeled by a top row indicating the group number and element number of an element in the multiplier
12
and a bottom row indicating the group number and element number of an element in the multiplicand
14
. The top, set of partial products, consists of the even numbered elements in the multiplier
12
multiplied by the first, low order element, of the multiplicand 14. Starting from the right, a least significant 32-bit partial product consists of the product of the multiplier
12
least significant element (00) multiplied by multiplicand
14
element (00). The next most significant 32-bit partial product consists of a 32-bit product of the second low order (02) multiplier
12
element multiplied by the low order multiplicand
14
element (00). This extends up through the high order multiplier
12
element (36) multiplied by the low order multiplicand
14
element (00). The next line down contains the odd-numbered elements from the multiplier
12
multiplied by the low order (00) multiplicand
14
offset by 16 bits to the left. Note should be made that this lines up with the second element in either the multiplicand
14
or the multiplier
12
. The rightmost 32-bit is the 32-bit product of the (01) element of the multiplier
12
and the (00) element of the multiplicand
14
. This extends up through the high order (37, 00) 32-bit product. Following this line, and equally indented, is the set of partial products formed by multiplying the even-numbered elements of the multiplier
12
by the second (01) multiplicand
14
element. This continues, with each pair of partial products being offset by 16-bits to the left over that of the previous lower order terms, until the final partial product which consists of products of the odd-numbered multiplier
12
elements multiplied by the high order (07) multiplicand
14
elements. The 16 partial products are then summed to generate the product
18
. Since the multiplier
12
consists of eight 128-bit groups and the multiplicand
14
consists of one 128-bit group, the product
18
consists of five 128-bit groups. It can be seen, that the offset is provided in order to line up corresponding bits and elements for the summation.
REFERENCES:
patent: 5457804 (1995-10-01), Ohtomo
patent: 5784305 (1998-07-01), Nadehara
patent: 5809292 (1998-09-01), Wilkinson et al.
patent: 5859789 (1999-01-01), Sidwell
patent: 5862067 (1999-01-01), Mennemeier et al.
patent: 5880483 (1999-03-01), Elliott et al.
patent: 5880984 (1999-03-01), Burchfiel et al.
patent: 5941941 (1999-08-01), Hasegawa
Montgomery, “Modular Multiplication Without Trial Division”, Mathematics of Computation, vol. 44, No. 170, pp. 519-521 (1985).
Malzahn David H.
Motorola Inc.
LandOfFree
SIMD data processing extended precision arithmetic operand... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with SIMD data processing extended precision arithmetic operand..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and SIMD data processing extended precision arithmetic operand... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2519541