Electrical computers and digital processing systems: processing – Instruction decoding – Predecoding of instruction component
Reexamination Certificate
2001-07-19
2004-12-14
Kim, Kenneth S. (Department: 2111)
Electrical computers and digital processing systems: processing
Instruction decoding
Predecoding of instruction component
C712S210000, C712S226000
Reexamination Certificate
active
06832307
ABSTRACT:
TECHNICAL FIELD OF THE INVENTION
The present invention is directed, in general, to maximizing instruction throughput in a pipelined processor and, more specifically, to folding instructions.
BACKGROUND OF THE INVENTION
Pipelined processors are capable of concurrently executing several different assembly or machine language instructions by breaking the processing steps for each instruction into several discrete processing phases, each of which is executed by a separate pipeline stage. Each instruction must pass through each processing phase—and therefore each pipeline stage—sequentially to complete execution. Within an n stage pipeline (where “n” is any positive nonzero integer), each instruction requires n processing phases to complete execution, although typically at least one instruction may be completed every clock cycle.
Generally a given instruction requires processing by only one pipeline stage at a time (i.e., within any given clock cycle). Since instructions all use the pipeline stages in the same order, an n stage pipeline is capable of working on n instructions concurrently. The execution rate is thus theoretically n times faster than an equivalent non-pipelined processor in which every phase of execution for one instruction must be completed prior to initiation of processing of another instruction, although pipeline overheads and other factors typically make the actual performance improvement factor somewhat less than n.
As noted, a full pipeline can theoretically complete an instruction every clock cycle. One technique often employed to further increase instruction execution efficiency is folding, a process generally performed by the decode stage and involving combination of two or more program instructions into a single instruction which can be executed more quickly. In a typical case, m instructions (where “m” is any positive nonzero integer), each of which would individually require 1 pipeline cycle to execute, are combined into a single instruction taking only one pipeline cycle total to execute, saving m−1 pipeline cycles.
The folding technique relies upon: (1) the ability of the instruction decoder to extract two or more instructions per clock cycle from the instruction fetch buffer from which the instruction decoder receives instructions, combine instructions (suitably), and forward the resulting single “pseudo” instruction to the operand fetch and execution stages; (2) the ability of the instruction fetch stage to supply (on average) more than one instruction per clock cycle to the instruction fetch buffer so that the instruction fetch buffer normally contains more than one instruction during any given clock cycle, giving the decoder an opportunity to fold instructions; and (3) the ability of the operand fetch and execution stages together to handle operations more complex than those expressed by any individual instruction within the processor's normal instruction set, making possible the combination of instructions into more complex single-cycle operations.
As an example of instruction folding, consider a load and add instruction:
ld mem1, R1
(load contents of memory location
mem1 into register R1);
add R2, R1
(add contents of registers R1 and
R2 and place the result in
register R1).
These two instructions may be folded into a single load/add pseudo-operation:
ld/add mem1, R2, R1
(add contents of registers R1 and
R2 and place the result in
register R1),
which potentially takes only half the execution time.
Instruction folding schemes are limited, however, by the complexity of the instruction decoder, which typically must determine whether two or more instructions may be folded within a single clock cycle. To illustrate the problem, consider an instruction set architecture (ISA) of 100 instructions, out of which 10 different instructions may be folded as combined pairs for execution within a particular processor design. In this case, the instruction decoder must examine the first two instructions within the instruction fetch buffer for 100 possible folding combinations out of 10,000 possible combinations of two instructions. For decoders which support folding across more than only two instructions, the number of possible instruction combinations increases exponentially. In any case, such checks will significantly limit the decoder speed.
In practice, therefore, the instruction decode stage must strictly limit the scope of its search for folding combinations among the instructions contained within the instruction fetch buffer in order to complete the decode operation (which includes producing control information for subsequent pipeline stages) in a short period of time, usually one clock cycle. However, these constraints may produce unsatisfactory results, missing many folding opportunities. For instance, a series of instructions including a load, a subtract, and a store:
ld mem1, R1
(load contents of memory location
mem1 into register R1);
sub R2, R1
(subtract contents of register R2
from R1 and place the result in
register R1); and
st R1, mem2
(store contents of register R1 in
memory location mem2)
might be folded into a single-cycle pseudo-instruction:
ld/sub/st R2, mem1, R1/mem2
(subtract contents of R2 from
mem1 and place result in R1
and mem2).
If the instruction decode stage is limited to examining only two instructions within the instruction fetch buffer at a time, only the first two instructions would be folded and the resulting sequence:
Ld/sub R2, mem1, R1
(subtract contents of R2 from mem1
and place result in R1); and
st R1, mem2
(store contents of R1 in mem2)
would require two clock cycles to execute.
There is, therefore, a need in the art for improving instruction folding to allow examination of a greater number of instruction combination permutations for potential folding without impairing instruction decode speed.
SUMMARY OF THE INVENTION
To address the above-discussed deficiencies of the prior art, it is a primary object of the present invention to provide, for use in a processor, a plurality of fold decoders each coupled to a different set of successive entries within an instruction fetch buffer stack and check the contents of the successive entries for a variable number of variable-length instructions which may be folded. Folding information for each of the respective set of entries, identifying a number of instructions therein which may be folded (if any) and a size of each instruction which may be folded, is produced by the fold decoders and stored in the first entry of the set, then transmitted to the main decoder for use in folding instructions during decoding.
The foregoing has outlined rather broadly the features and technical advantages of the present invention so that those skilled in the art may better understand the detailed description of the invention that follows. Additional features and advantages of the invention will be described hereinafter that form the subject of the claims of the invention. Those skilled in the art will appreciate that they may readily use the conception and the specific embodiment disclosed as a basis for modifying or designing other structures for carrying out the same purposes of the present invention. Those skilled in the art will also realize that such equivalent constructions do not depart from the spirit and scope of the invention in its broadest form.
Before undertaking the DETAILED DESCRIPTION OF THE INVENTION below, it may be advantageous to set forth definitions of certain words or phrases used throughout this patent document: the terms “include” and “comprise,” as well as derivatives thereof, mean inclusion without limitation; the term “or” is inclusive, meaning and/or; the phrases “associated with” and “associated therewith,” as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like; and the term “controller” means any device, system or part t
Jorgenson Lisa K.
Kim Kenneth S.
Munck William A.
STMicroelectronics Inc.
LandOfFree
Instruction fetch buffer stack fold decoder for generating... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Instruction fetch buffer stack fold decoder for generating..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Instruction fetch buffer stack fold decoder for generating... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3336586