Loop cache memory and cache controller for pipelined...

Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C712S215000, C712S241000

Reexamination Certificate

active

06567895

ABSTRACT:

FIELD OF THE INVENTION
The present invention pertains generally to pipelined microprocessors, and pertains more particularly to methods and microprocessor structures for executing software pipelined loops.
BACKGROUND OF THE INVENTION
A microprocessor is a circuit that combines the instruction-handling, arithmetic, and logical operations of a computer on a single chip. A digital signal processor (DSP) is a microprocessor optimized to handle large volumes of data efficiently. Such processors are central to the operation of many of today's electronic products, such as high-speed modems, high-density disk drives, digital cellular phones, and complex automotive systems, and will enable a wide variety of other digital systems in the future. The demands placed upon DSPs in these environments continue to grow as consumers seek increased performance from their digital products.
Designers have succeeded in increasing the performance of DSPs generally by increasing clock frequencies, by removing architectural bottlenecks in DSP circuit design, by incorporating multiple execution units on a single processor circuit, and by developing optimizing compilers that schedule operations to be executed by the processor in an efficient manner. As further increases in clock frequency become more difficult to achieve, designers have embraced the multiple execution unit processor as a means of achieving enhanced DSP performance. For example,
FIG. 1
shows a block diagram of a DSP execution unit and register structure. having eight execution units, L
1
, S
1
, M
1
, D
1
, L
2
, S
2
, M
2
, and D
2
. These execution units operate in parallel to perform multiple operations, such as addition, multiplication, addressing, logic functions, and data storage and retrieval, simultaneously.
Theoretically, the performance of a multiple execution unit processor is proportional to the number of execution units available. However, utilization of this performance advantage depends on the efficient scheduling of operations such that most of the execution units have a task to perform each clock cycle. Efficient scheduling is particularly important for looped instructions, since in a typical runtime application the processor will spend the majority of its time in loop execution.
Traditionally, the compiler is the piece of software that performs the scheduling operations. The compiler is the piece of software that translates source code, such as C, BASIC, or FORTRAN, into a binary image that actually runs on a machine. Typically the compiler consists of multiple distinct phases. One phase is referred to as the front end, and is responsible for checking the syntactic correctness of the source code. If the compiler is a C compiler, it is necessary to make sure that the code is legal C code. There is also a code generation phase, and the interface between the front-end and the code generator is a high level intermediate representation. The high level intermediate representation is a more refined series of instructions that need to be carried out. For instance, a loop might be coded at the source level as: for(I=0,1<10,1=1+1), which might in fact be broken down into a series of steps, e.g. each time through the loop, first load up I and check it against 10 to decide whether to execute the next iteration.
A code generator of the code generator phase takes this high level intermediate representation and transforms it into a low level intermediate representation. This is closer to the actual instructions that the computer understands. An optimizer component of a compiler must preserve the program semantics (i.e. the meaning of the instructions that are translated from source code to an high level intermediate representation, and thence to a low level intermediate representation and ultimately an executable file), but rewrites or transforms the code in a way that allows the computer to execute an equivalent set of instructions in less time.
Source programs translated into machine code by compilers consists of loops, e.g. DO loops, FOR loops, and WHILE loops. Optimizing the compilation of such loops can have a major effect on the run time performance of the program generated by the compiler. In some cases, a significant amount of time is spent doing such bookkeeping functions as loop iteration and branching, as opposed to the computations that are performed within the loop itself. These loops often implement scientific applications that manipulate large arrays and data instructions, and run on high speed processors. This is particularly true on modern processors, such as RISC architecture machines. The design of these processors is such that in general the arithmetic operations operate a lot faster than memory fetch operations. This mismatch between processor and memory speed is a very significant factor in limiting the performance of microprocessors. Also, branch instructions, both conditional and unconditional, have an increasing effect on the performance of programs. This is because most modern architectures are super-pipelined and have some sort of a branch prediction algorithm implemented. The aggressive pipelining makes the branch misprediction penalty very high. Arithmetic instructions are interregister instructions that can execute quickly, while the branch instructions, because of mispredictions, and memory instructions such as loads and stores, because of slower memory speeds, can take a longer time to execute.
One effective way in which looped instructions can be arranged to take advantage of multiple execution units is with a software pipelined loop. In a conventional scalar loop, all instructions execute for a single iteration before any instructions execute for following iterations. In a software pipelined loop, the order of operations is rescheduled such that one or more iterations of the original loop begin execution before the preceding iteration has finished. Referring to
FIG. 7
, a simple scalar loop containing 20 iterations of the loop of instructions A, B, C, D and E is shown.
FIG. 8
depicts an alternative execution schedule for the loop of
FIG. 7
, where a new iteration of the original loop is begun each clock cycle. For clock cycles I
4
-I
19
in the same instruction (A
n
, B
n−1
, C
n−2
, D
n−3
, E
n−4
) is executed each clock cycle in this schedule; If multiple execution units are available to execute these operations in parallel, the code can be restructured to perform this repeated instruction in a loop. The repeating pattern of A,B,C,D,E (along with loop control operations) thus forms the loop kernel of a new, software pipelined loop that executes the instructions at clock cycles I
4
-I
19
in 16 loops. The instructions executed at clock cycles I, through 13 of
FIG. 8
must still be executed first in order to properly “fill” the software pipelined loop; these instructions are referred to as the loop prolog. Likewise, the instructions executed at clock cycles 120 and 123 of
FIG. 8
must still be executed in order to properly “drain” the software pipeline; these instructions are referred to as the loop epilog (note that in many situations the loop epilog may be deleted through a technique known as speculative execution).
The simple example of
FIGS. 7 and 8
illustrates the basic principles of software pipelining, but other considerations such as dependencies and conflicts may constrain a particular scheduling solution. For an explanation of software pipelining in more detail, see Vicki H. Allan,
Software Pipelining
, 27 ACM Computing Surveys 367 (1995).
One disadvantage of software pipelining is the need for a specialized loop prolog for each loop. The loop prolog explicitly sequences the initiation of the first several iterations of a pipeline, until the steady-state loop kernel can be entered (this is commonly called “filling” the pipeline). Steady-state operation is achieved only after every instruction in the loop kernel will have valid operands if the kernel is executed. As a rule of thumb, the loop kernel can be executed in steady state a

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Loop cache memory and cache controller for pipelined... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Loop cache memory and cache controller for pipelined..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Loop cache memory and cache controller for pipelined... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3035555

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.