Modulo address generation method and apparatus

Electrical computers and digital processing systems: memory – Address formation – Generating a particular pattern/sequence of addresses

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C711S110000, C711S210000, C711S218000

Reexamination Certificate

active

06584556

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates generally to data address generation in a digital data processing system, and, in particular, to a data address generator which generates modulo addresses for addressing data operands stored in a circular buffer.
BACKGROUND OF THE INVENTION
Digital processing of analog signals is critical to many important commercial applications, including such diverse fields as telecommunication networks, audio and video presentation devices, and computer controlled systems. Such applications typically utilize classic time-invariant algorithms, such as digital filtering and Fourier transforms. Although differing in their implementation details, these algorithms share a common characteristic: dependence upon a basic mathematical operation—the multiply and accumulate (“MAC”). In a “MAC operation”, a first data operand is multiplied by a second data operand, and the product is added to the current contents of an “accumulator”. In most such applications, the speed with which a MAC operation is performed is considered critical.
If the data operands are themselves simply elements of data operand “vectors”, as is often the case, each MAC operation requires pre-loading of an appropriate pair of operands using respective access address “pointers” into the data vectors, and then post-modification of each of the pointers according to a specific address access pattern. Typically, the access patterns are different for each of the data vectors. In some applications, one (or both) of the data vectors may be too large to fit into available system memory at one time, thus requiring further overhead to move each over-sized vector through a conveniently sized “buffer” which is allocated in either system or local memory. In general, each buffer is specified in terms of a starting “base address” and a “modulo” length, and the operands in that buffer are accessed according to an access pattern having a particular step “offset” size. In many algorithms, at least one of the buffers is accessed in a modulo manner, wherein a pointer that steps beyond the end of the buffer is wrapped, modulo the length of the buffer, back into the buffer. For the purpose of the description that follows, I will use the term “circular buffer” to refer to any memory-based data buffer which is accessed in such a modulo manner, regardless of whether or not the size of the buffer is less than or equal to the size of the data vector which may be stored therein.
In general, it is the presence of an execution unit (“EU”) especially designed to efficiently perform an atomic MAC operation that distinguishes a digital signal processor (“DSP”) from a general purpose digital data processor. In view of the importance of timely supplying the MAC EU with operands, many DSP's incorporate a pair of special purpose data address generators (“DAGs”) to assist the load/store unit (“LSU”) in supplying operands to the MAC EU. In such DSP's, a single atomic “MAC instruction” may be provided to allow a programmer to specify both the details of the MAC operation and, via special purpose registers, the characteristics of each of the operand access patterns.
It has occurred to me that application of conventional microprocessor design concepts to DSPs should prove beneficial for numerous reasons. First, the majority of DSP algorithms involve loops. Second, DSP algorithms tend to be computationally intensive. Third, DSP application code is usually relatively small, with relatively few conditional branches, thus reducing the control logic required for branch prediction. Fourth, many modern DSPs have dedicated hardware for loop operations. Finally, the results of such operations are often only interim results which are consumed within the loop and never used again, thus reducing register pressure and traffic through the LSU.
For the purpose of making relative performance comparisons in the description that follows, I shall estimate circuit performance in terms of “units of delay”, wherein I define One (1) unit of delay as the time required for an input signal to traverse a typical 3-input NAND gate and settle to the correct output logic level at the input of the next level of logic. Using a state of the art 0.18 micron manufacturing process, One (1) delay unit is approximately One Hundred (100) picoseconds. I will assume that such a typical gate would be implemented as a single, contiguous physical unit or cell of minimal sized transistors with minimum inter-transistor wiring. In all estimates that I shall make herein, I will also assume that, within each discrete functional unit, such as an adder, all requisite gates comprise a single, contiguous physical unit or super-cell so as to minimize inter-gate wiring.
In modern DSP's, the longest stage of the processing “pipeline” is the single-cycle MAC EU. Using current state of the art logic design, the critical speed path through a MAC EU is approximately Forty (40) delay units. Thus, the maximum clock rate for such a design would be on the order of Two Hundred Fifty (250) MHz. In contrast, the critical speed path through a current state of the art DAG is approximately Twenty (20) delay units. Since the DAG is already twice as fast as it needs to be to keep up with the MAC EU, there has been little incentive to improve its performance, particularly since such improvement would come only at the cost of additional hardware, power consumption, waste heat, etc.
In the field of general purpose digital data processors, it has been demonstrated that considerable improvement in performance can be achieved by employing a very deep pipeline, on the order of Twelve (12) stages or more, and increasing the clock rate accordingly. In high performance processors, careful attention is given to partitioning the pipeline so as to balance the relative speed paths through each stage. A significant imbalance may indicate the desirability of splitting that stage into multiple stages or of augmenting that stage with additional hardware resources. In either case, the consequences on relative cost to performance must be considered.
In a modern deeply pipelined microprocessor, such as the “Alpha” (originally designed by engineers working for the Digital Equipment Company), the theoretical clock-cycle-limiting pipe stage is considered to consist of an input latch, a minimum arithmetic logic unit (“ALU”) operation, and result forwarding back to the input latch, requiring about Eleven (11) delay units using current state of the art design techniques. Such a design allows single-cycle ALU forwarding, while achieving high clock frequency rates. It is also close to the minimum time required to drive and sample a state of the art memory array, such as a 64×64 static random access memory (SRAM) array. If such design techniques could be effectively applied to the MAC in a DSP, one might expect to realize commensurate improvement in system performance. However, just deeply-pipelining the MAC is not sufficient to achieve the desired 11-delay-unit clock cycle: the clock-cycle-limiting stage is now the DAG!
FIG. 1
illustrates a prior art data address generator (DAG
2
) adapted for use in a DSP processor (not shown) having at least One (1), memory resident, data operand buffer (not shown), the location and size of which are specified by a base address (“B”) and a length (“L”), stored in respective registers (not shown). The single-stage DAG
2
is constructed to generate, each clock cycle, an index pointer (“I”) to the next operand in the buffer as a function of B, L, and an offset (“M”). In operation, the index pointer, I, steps through the buffer in increments of M. When I steps beyond the end of the buffer, i.e. where I is greater than (B+L), L is subtracted from I so that I wraps back, modulo L, to a valid address inside the buffer. Such a modulo address generation method can be described by the following algorithm, illustrated in the form of pseudocode:
for (a=0; a<LoopCount; a++)
{
if ((I+M)<(B+L))
I
a+1
=(I
a
+M);

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Modulo address generation method and apparatus does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Modulo address generation method and apparatus, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Modulo address generation method and apparatus will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3138383

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.