Electrical computers and digital processing systems: memory – Storage accessing and control – Access timing
Reexamination Certificate
1999-04-06
2003-05-13
Kim, Matthew (Department: 2186)
Electrical computers and digital processing systems: memory
Storage accessing and control
Access timing
C711S150000, C711S120000
Reexamination Certificate
active
06564309
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a digital signal processor (DSP), and more specifically to an architecture that avoids problems resulting from memory latency.
2. Discussion of the Related Art
FIG. 1
schematically and partially shows a conventional DSP architecture. The DSP includes four processing units operating in parallel. Two of these units are memory access units (MEMU)
10
. An arithmetic and logic unit (ALU)
12
and a branch unit (BRU)
14
are further provided. Each of the MEMU units is associated with a memory
16
via an independent bus.
Branch unit BRU receives from an instruction memory, not shown, a compound instruction INST which can include four elementary instructions to be provided to each of the units in parallel. Unit BRU retrieves the instruction meant for it and distributes in parallel the three remaining instructions I
1
, I
2
, and I
3
to the ALU and MEMU units.
Each of the ALU and MEMU units generally includes an instruction queue
18
, in the form of a FIFO, in which the instructions wait before they are processed by the corresponding unit.
A DSP of the type of
FIG. 1
is optimized to perform vectorial operations of the type X[i] OP Y[j], where i and j vary, generally in a loop, and where OP designates any operation to be performed by arithmetic unit
12
. Indeed, operands X[i] and Y[j] can be fetched together via the two buses of memory
16
and processed, in theory, in the same cycle by ALU
12
.
In practice, difficulties arise due to the structure of currently used memories, generally SRAMs. Although a memory access can be performed at each cycle, the reading of data from a conventional SRAM generally has a latency of two cycles. Indeed, upon execution of a read instruction, the address is presented to the memory. An additional cycle is required to provide the memory with a read access signal, and a last cycle is required for the memory to present the data over its data bus.
To illustrate the resulting difficulties, a common loop, the function of which is to increment by a constant successive values stored in the memory will be considered hereafter as an example. This loop may directly be written as:
LD: R
1
=[
i]
OP: R
1
=
R
1
+
R
2
ST: [i]=R
1
BR:
test
i, i++,
loop (1)
This loop, for clarity, uses a single MEMU unit. It consists of loading (LD) in a register R
1
the value stored in the memory at address i, incrementing (OP) the content of register R
1
by the value contained in a register R
2
, storing (ST) at address i the new content of register R
1
, and finally incrementing and testing address i to resume the loop (BR). The loop will be left when branch unit BRU detects that address i has reached a predetermined value. In a DSP, there are generally no BR-type instructions. The loops are programmed in advance by setting registers provided for this purpose in unit BRU which performs the tests, incrementations and branchings independently.
Register R
1
is a work register of the ALU, while address i is stored in a register of branch unit BRU. Operations LD and ST are operations to be performed by one of units MEMU, operation OP is to be performed by unit ALU, and operation BR is to be performed by unit BRU. Operations LD and OP will be provided in parallel to units MEMU and ALU in a same compound instruction, while operations ST and BR will be provided in parallel to units MEMU and BRU in a second compound instruction.
In fact, some compound instructions include fields which are provided to several units at a time. For example, a load instruction LD, meant for a unit MEMU, also includes a field meant for unit ALU to prepare one of its registers (R
1
) to receive the data which will be presented by the memory. Similarly, a store instruction ST includes a field meant for unit ALU to select a register, the content of which is presented over the memory bus. Thus, as shown in
FIG. 1
, a field f of each of instructions I
2
and I
3
provided to units MEMU is provided to the instruction queue
18
of unit ALU in parallel with a normal instruction I
1
, and unit ALU is able to perform in one cycle a normal instruction and an operation indicated by a field f.
The following table illustrates, for several iterations of the loop, the operations performed by one memory access unit MEMU and by arithmetic unit ALU. The branching instructions BR do not raise any problem and the table does not illustrate them, for clarity.
Each row in the table corresponds to an instruction cycle and each operation marked in the table is assigned with a number corresponding to the loop iteration.
Instruction queues
Units
Cycle
MEMU
ALU
MEMU
ALU
1
LD1
OP1
LD1
—
2
ST1
—
—
OP1
3
LD2
OP2
—
OP1
ST1
OP1
4
ST2
ST1
—
LD2
OP2
ST1
5
LD3
OP3
LD2
—
ST2
LD2
OP2
6
ST3
—
—
LD3
OP3
ST2
OP2
7
LD4
OP4
—
OP2
ST3
LD3
OP3
ST2
OP2
8
ST4
ST2
—
LD4
OP4
ST3
LD3
OP3
ST2
9
LD5
OP5
LD3
—
ST4
LD4
OP4
ST3
LD3
OP3
10
ST5
—
—
LD5
OP5
ST4
LD4
OP4
ST3
OP3
11
LD6
OP6
—
OP3
ST5
LD5
OP5
ST4
LD4
OP4
ST3
OP3
12
ST6
ST3
—
LD6
OP6
ST5
LD5
OP5
ST4
LD4
OP4
ST3
At the first cycle, units MEMU and ALU receive the first instructions LD and OP (LD
1
, OP
1
). Unit MEMU immediately executes instruction LD
1
, which starts the read cycle of the value stored at address i in the memory. Instruction LD
1
is deleted from the instruction queue of unit MEMU. Instruction OP
1
, which needs the value fetched by instruction LD
1
, cannot be executed yet. This instruction OP
1
waits in the instruction queue of unit ALU.
At the second cycle, unit MEMU receives the first instruction ST (ST
1
). Instruction ST
1
, which needs the result of operation OP
1
, cannot be executed yet and waits in the instruction queue of unit MEMU. Instruction OP
1
still waits in the queue of unit ALU, since the memory still does not send back the operand that it requires.
At the third cycle, units MEMU and ALU receive instructions LD
2
and OP
2
. These instructions are put in the queues after the still unexecuted instructions ST
1
and OP
1
. The memory finally sends back the operand required by instruction OP
1
. This instruction OP
1
is then executed and deleted from the instruction queue.
At cycle 4, unit MEMU receives instruction ST
2
. Instruction ST
2
is put to wait in the queue of unit MEMU after instruction LD
2
. Since instruction OP
1
was executed at the previous cycle, its result is available. Instruction ST
1
can thus be executed and deleted from the queue. Although instruction OP
2
is alone in the queue of unit ALU, this instruction cannot be executed yet since it requires an operand which will be fetched by the execution of instruction LD
2
.
At cycle 5, units MEMU and ALU receive instructions LD
3
and OP
3
. Instruction LD
2
is executed and deleted from the queue.
Instruction OP
2
must still wait in the queue, since it requires an operand which will be sent back two cycles later by the memory in response to instruction LD
2
.
From the fifth cycle on, the execution of the instructions of the second iteration of the loop proceeds as for the first iteration starting at cycle 1.
As shown by the table, although the processor is capable of performing one memory access at each cycle, it only performs memory access two cycles out of four, that is, the loop execution efficiency is only 50%.
Further, upon each new iteration of the loop, the instruction queue of unit MEMU fills with additional instructions and ends up overflowing. To avoid the overflow, the provision of instructions must be regularly stopped to enable the queues to empty. This considerably decreases the efficiency.
In fact, the loop programming in its straightforward form is not at all optimal due to the memory latency.
To improve the efficiency, taking account of the memory latency, a so-called loop unrolling technique is often used. This technique consists of programming a macroloop, each iteration of which corresponds to several iterations of the normal loop. Thus, the preceding loop (1) is written, f
Anderson Matthew D.
Engelson Gary S.
Kim Matthew
Morris James H.
STMicroelectronics S.A.
LandOfFree
DSP architecture optimized for memory accesses does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with DSP architecture optimized for memory accesses, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and DSP architecture optimized for memory accesses will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3079654