Electrical computers and digital processing systems: processing – Processing control – Branching
Reexamination Certificate
1999-06-02
2001-10-16
Pan, Daniel H. (Department: 2183)
Electrical computers and digital processing systems: processing
Processing control
Branching
C712S237000, C712S239000, C712S206000, C712S207000, C712S023000, C711S137000, C711S128000, C711S204000, C711S207000, C711S213000, C711S212000
Reexamination Certificate
active
06304962
ABSTRACT:
BACKGROUND
1. Technical Field
The present invention relates generally to computer processing systems and, in particular, to a method and apparatus for prefetching superblocks in a computer processing system.
2. Background Description
Early microprocessors generally processed instructions one at a time. Each instruction was processed using four sequential stages: instruction fetch; instruction decode; instruction execute; and result writeback. Within such microprocessors, different dedicated logic blocks performed each different processing stage. Each logic block waited until all the preceding logic blocks completed operations before beginning its operation.
Improved computational speed has been obtained by increasing the speed with which the computer hardware operates and by introducing parallel processing in one form or another. One form of parallel processing relates to the recent introduction of microprocessors of the “superscalar” type, which can effect parallel instruction computation. Typically, superscalar microprocessors have multiple execution units (e.g., multiple integer arithmetic logic units (ALUs)) for executing instructions and, thus, have multiple “pipelines”. As such, multiple machine instructions may be executed simultaneously in a superscalar microprocessor, providing obvious benefits in the overall performance of the device and its system application.
For the purposes of this discussion, latency is defined as the delay between the fetch stage of an instruction and the execution stage of the instruction. Consider an instruction which references data stored in a specified register. Such an instruction requires at least four machine cycles to complete. In the first cycle, the instruction is fetched from memory. In the second cycle, the instruction is decoded. In the third cycle, the instruction is executed and, in the fourth cycle, data is written back to the appropriate location.
To improve efficiency and reduce instruction latency, microprocessor designers overlapped the operations of the fetch, decode, execute, and writeback logic stages such that the microprocessor operated on several instructions simultaneously. In operation, the fetch, decode, execute, and writeback logic stages concurrently process different instructions. At each clock pulse the result of each processing stage is passed to the subsequent processing stage. Microprocessors that use the technique of overlapping the fetch, decode, execute, and writeback stages are known as “pipelined” microprocessors. In principle, a pipelined microprocessor can complete the execution of one instruction per machine cycle when a known sequence of instructions is being executed. Thus, it is evident that the effects of the latency time are reduced in pipelined microprocessors by initiating the processing of a second instruction before the actual execution of the first instruction is completed.
In general, instruction flow in a microprocessor requires that the instructions are fetched and decoded from sequential locations in a memory. Unfortunately, computer programs also include branch instructions. A branch instruction is an instruction that causes a disruption in this flow, e.g., a taken branch causes decoding to be discontinued along the sequential path, and resumed at a new location in memory. The new location in memory may be referred to as a target address of the branch. Such an interruption in pipelined instruction flow results in a substantial degradation in pipeline performance.
There are various types of branch instructions. One type of branch instruction is known as an unconditional branch in that it unconditionally transfers control from the branch instruction to the target instruction. That is, at the time t hat the branch instruction is decoded, it is known that the transfer of control to the target instruction will take place. Examples of unconditional branches include subroutine CALL/RETURN and GOTO. In terms of performance, a more costly branch instruction is known as a conditional branch. This instruction specifies that control is to be transferred to the target instruction only if some condition, as determined by the outcome of a previous instruction, is met. Examples of conditional branch constructs include the DO LOOP and the IF/THEN/ELSE.
Unconditional branches, such as subroutine CALL/RETURN and GOTO, are always taken. Thus, their behavior seems easy to predict. On the other hand, conditional branching constructs, such as the DO LOOP and the IF/THEN/ELSE block, do not always follow the same path. In actuality, a DO LOOP is a fairly well-behaved structure, usually resulting in a taken branch to a destination at a negative displacement off the program counter. However, the IF/THEN/ELSE block may not exhibit predictable behavior since it may change direction on subsequent iterations.
If it can be determined at instruction decode time that a conditional branch instruction will not be taken, then there is no penalty associated with the execution of the conditional branch instruction. That is, the next sequential instruction may be decoded immediately following the decoding of the branch instruction. However, if it is determined that the branch will be taken, a multi-cycle penalty associated with the branch is incurred in that the target address must be generated and the target instruction must be fetched.
Accordingly, many pipelined microprocessors use branch prediction mechanisms that predict the existence (i.e., taken or not taken) and the outcome of branch instructions within an instruction stream. The instruction fetch unit uses the branch predictions to fetch subsequent instructions.
When a branch prediction mechanism predicts the outcome of a branch instruction and the microprocessor executes subsequent instructions along the predicted path, the microprocessor is said to have “speculatively executed” along the predicted instruction path. During speculative execution the microprocessor is performing useful processing if the branch instruction was predicted correctly.
However, if the branch prediction mechanism mispredicted the branch instruction, the microprocessor is executing instructions down the wrong path and, thus, accomplishes nothing. When the microprocessor eventually detects the mispredicted branch, the microprocessor must flush the instructions that were speculatively fetched from the instruction pipeline and restart execution at the correct address. The effect of the above described non-sequential operation, and of the resultant flushing of the pipeline, is exacerbated in the case of superscalar pipelined microprocessors. For example, if a branch or other interruption in the sequential instruction flow of the microprocessor occurs, the number of lost pipeline slots, or lost execution opportunities, is multiplied by the number of parallel execution units (i.e., parallel pipelines). The performance degradation due to branches and corresponding non-sequential program execution is therefore amplified in superscalar pipelined microprocessors.
Thus, when the number of stages in a processor pipeline increases, the latency between the instruction execute stage and the instruction fetch stage becomes longer. This causes an increased delay in the launching of instructions after a conditional branch which depends on the execution of a prior instruction. Unless useful work can be inserted between the instructions and the conditional branch by the compiler, the delay results in lost cycles.
To reduce the number of lost cycles, especially in the common situation where the compiler is unable to insert more than a couple of instructions during the time in which the branch instruction is being resolved, it is necessary for the hardware to accurately anticipate (predict) the result of executing a particular branch and fetch instructions along the appropriate stream. When the prediction is correct, lost cycles are avoided. However, when the prediction is incorrect, it is possible that the number of lost cycles is greater than if no prediction was attempted. It therefore becomes essential to in
F. Chau & Associates LLP
International Business Machines - Corporation
Pan Daniel H.
LandOfFree
Method and apparatus for prefetching superblocks in a... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for prefetching superblocks in a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for prefetching superblocks in a... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2611075