Self-priming loop execution for loop prolog instruction

Electrical computers and digital processing systems: processing – Processing control – Branching

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C712S241000, C717S152000, C708S300000

Reexamination Certificate

active

06289443

ABSTRACT:

FIELD OF THE INVENTION
The present invention pertains generally to the operation of pipelined microprocessors, and pertains more particularly to methods usable on such microprocessors for executing software pipelined loops.
BACKGROUND OF THE INVENTION
A microprocessor is a circuit that combines the instruction-handling, arithmetic, and logical operations of a computer on a single chip. A digital signal processor (DSP) is a microprocessor optimized to handle large volumes of data efficiently. Such processors are central to the operation of many of today's electronic products, such as high-speed modems, high-density disk drives, digital cellular phones, and complex automotive systems, and will enable a wide variety of other digital systems in the future. The demands placed upon DSPs in these environments continue to grow as consumers seek increased performance from their digital products.
Designers have succeeded in increasing the performance of DSPs generally by increasing clock frequencies, by removing architectural bottlenecks in DSP circuit design, by incorporating multiple execution units on a single processor circuit, and by developing optimizing compilers that schedule operations to be executed by the processor in an efficient manner. As further increases in clock frequency become more difficult to achieve, designers have embraced the multiple execution unit processor as a means of achieving enhanced DSP performance. For example,
FIG. 1
shows a block diagram of a DSP execution unit and register structure having eight execution units, L
1
, S
1
, M
1
, D
1
, L
2
, S
2
, M
2
, and D
2
. These execution units operate in parallel to perform multiple operations, such as addition, multiplication, addressing, logic functions, and data storage and retrieval, simultaneously.
Theoretically, the performance of a multiple execution unit processor is proportional to the number of execution units available. However, utilization of this performance advantage depends on the efficient scheduling of operations so that most of the execution units have a task to perform each clock cycle. Efficient scheduling is particularly important for looped instructions, since the processor will typically spend the majority of its time in loop execution.
One effective way in which looped instructions can be arranged to take advantage of multiple execution units is with a software pipelined loop. In a conventional scalar loop, all instructions execute for a single iteration before any instructions execute for following iterations. In a software pipelined loop, the order of operations is rescheduled such that one or more iterations of the original loop begin execution before the preceding iteration has finished.
Referring to
FIG. 2
a
, a simple loop containing 7 iterations of the operations A, B, and C is shown.
FIG. 2
b
depicts an alternative execution schedule for the loop of
FIG. 2
a
, where a new iteration of the original loop is begun each clock cycle. For clock cycles I
3
-I
7
, the same instruction (A
n
,B
n−1
,C
n−2
) is executed each clock cycle in this schedule; if multiple execution units are available to execute these operations in parallel, the code can be restructured to perform this repeated instruction in a loop. The repeating pattern of A,B,C (along with loop control operations) thus forms the loop kernel of a new, software pipelined loop that executes the instructions at clock cycles I
3
-I
7
in 5 loops.
FIG. 2
c
depicts such a loop. The instructions executed at clock cycles I
1
and I
2
of
FIG. 2
b
must still be executed first in order to properly “fill” the software pipelined loop; these instructions are referred to as the loop prolog. Likewise, the instructions executed at clock cycles I
8
and I
9
of
FIG. 2
b
must still be executed in order to properly “drain” the software pipeline; these instructions are referred to as the loop epilog. Note that in many situations the loop epilog may be deleted through a technique known as speculative execution.
The simple example of
FIGS. 2
a
-
2
c
illustrates the basic principles of software pipelining, but other considerations such as dependencies and conflicts may constrain a particular scheduling solution. For an explanation of software pipelining in more detail, see Vicki H. Allen,
Software Pipelining,
27 ACM Computing Surveys 367 (1995).
SUMMARY OF THE INVENTION
One disadvantage of software pipelining is the need for a specialized loop prolog for each loop. The loop prolog explicitly sequences the initiation of the first several iterations of a pipeline, adding instructions each clock cycle until the steady-state loop kernel can be entered. This is commonly called “filling” the pipeline. Steady-state operation is achieved once every instruction in the loop kernel will have valid operands if the kernel is executed. As a rule of thumb, the loop kernel can be executed in steady state after k=1−m clock cycles, where 1 represents the number of clock cycles required to complete one iteration of the pipelined loop, and m represents the number of clock cycles contained in one iteration of the loop kernel. This formula must generally be modified if the kernel is unrolled.
Given this relationship, it can be appreciated that as the cumulative pipeline delay required by a single iteration of a pipelined loop increases, corresponding increases in loop prolog length are usually observed. In some cases, the loop prolog code required to fill the pipeline may be several times the size of the loop kernel code. Code size can be a determining factor in execution speed. Shorter programs can generally use on-chip program memory to a greater extent than longer programs. Thus long loop prologs can be detrimental to program execution speed.
The present invention seeks to reduce code size by eliminating at least a portion of the loop prolog required by prior art software pipelining. The present invention accomplishes this by “self-priming”; i.e., the processor executes the loop kernel for additional iterations in lieu of executing separate prolog code. Of course, the elimination of instructions necessary to fill the pipeline from the prolog does not remove the necessity of the instructions themselves. These instructions are repeated in the loop body and can be executed there.
One problem with utilizing the loop body instructions to prime the loop is that many loop instructions would have invalid operands if the loop is executed without a prolog. Thus the invention implies that one or more steady-state loop operations may be executed prematurely, i.e., before the pipeline reaches steady state, while the loop is “self-priming”. For purposes of this disclosure, an operation executes prematurely if valid data needed to perform the operation in steady-state operation is not yet available through the pipeline. Because premature operations typically produce nonsensical results, and potentially produce results harmful to the ultimate loop result, the present invention also provides several methods for preventing premature loop operations from altering the desired loop result.
In one aspect, the present invention provides a method for operating a processor in a software pipelined loop that eliminates the need for at least part of a loop prolog. Generally, this method comprises sequencing the processor through a desired number of iterations of a software pipelined loop kernel. However, this method further comprises the sequencing of at least one additional iteration of the software pipelined loop kernel through the processor before the pipeline reaches steady-state operation, while insulating the loop results from deleterious effects of the additional iteration. The additional iteration thus may advantageously be used to replace a portion of the loop prolog, typically resulting in shorter code size.
Several general methods are disclosed for insulating loop results from deleterious effects of additional iterations. A first method is output array over allocation. An output array is padded with additional adjacent memory locations s

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Self-priming loop execution for loop prolog instruction does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Self-priming loop execution for loop prolog instruction, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Self-priming loop execution for loop prolog instruction will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2544912

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.