Electrical computers and digital processing systems: processing – Processing control – Branching
Reexamination Certificate
1999-06-02
2001-09-11
Pan, Daniel (Department: 2183)
Electrical computers and digital processing systems: processing
Processing control
Branching
C712S217000, C712S216000, C712S239000, C712S240000, C712S234000, C712S231000
Reexamination Certificate
active
06289444
ABSTRACT:
BACKGROUND
1. Technical Field
The present invention relates generally to computer processing systems and, in particular, to a method and apparatus for predicting the target of a subroutine return branch in a computer processing system. The present invention may be employed in the case of conventional subroutines, nested subroutines, foliated subroutines, and in the case of subroutine invocations through stubs (such as, for example, in the cases of virtual method invocation or dynamic library procedure invocation).
2. Background Description
Early microprocessors generally processed instructions one at a time. Each instruction was processed using four sequential stages: instruction fetch; instruction decode; instruction execute; and result writeback. Within such microprocessors, different dedicated logic blocks performed each different processing stage. Each logic block waited until all the preceding logic blocks completed operations before beginning its operation.
Improved computational speed has been obtained by increasing the speed with which the computer hardware operates and by introducing parallel processing in one form or another. One form of parallel processing relates to the recent introduction of microprocessors of the “superscalar” type, which can effect parallel instruction computation. Typically, superscalar microprocessors have multiple execution units (e.g., multiple integer arithmetic logic units (ALUs)) for executing instructions and, thus, have multiple “pipelines”. As such, multiple machine instructions may be executed simultaneously in a superscalar microprocessor, providing obvious benefits in the overall performance of the device and its system application.
For the purposes of this discussion, latency is defined as the delay between the fetch stage of an instruction and the execution stage of the instruction. Consider an instruction which references data stored in a specified register. Such an instruction requires at least four machine cycles to complete. In the first cycle, the instruction is fetched from memory. In the second cycle, the instruction is decoded. In the third cycle, the instruction is executed and, in the fourth cycle, data is written back to the appropriate location.
To improve efficiency and reduce instruction latency, microprocessor designers overlapped the operations of the fetch, decode, execute, and writeback logic stages such that the microprocessor operated on several instructions simultaneously. In operation, the fetch, decode, execute, and writeback logic stages concurrently process different instructions. At each clock pulse the result of each processing stage is passed to the subsequent processing stage. Microprocessors that use the technique of overlapping the fetch, decode, execute, and writeback stages are known as “pipelined” microprocessors. In principle, a pipelined microprocessor can complete the execution of one instruction per machine cycle when a known sequence of instructions is being executed. Thus, it is evident that the effects of the latency time are reduced in pipelined microprocessors by initiating the processing of a second instruction before the actual execution of the first instruction is completed.
In general, instruction flow in a microprocessor requires that the instructions are fetched and decoded from sequential locations in a memory. Unfortunately, computer programs also include branch instructions. A branch instruction is an instruction that causes a disruption in this flow, e.g., a taken branch causes decoding to be discontinued along the sequential path, and resumed at a new location in memory. The new location in memory may be referred to as a target address of the branch. Such an interruption in pipelined instruction flow results in a substantial degradation in pipeline performance.
There are various types of branch instructions. One type of branch instruction is known as an unconditional branch in that it unconditionally transfers control from the branch instruction to the target instruction. That is, at the time that the branch instruction is decoded, it is known that the transfer of control to the target instruction will take place. Examples of unconditional branches include subroutine CALL/RETURN and GOTO. In terms of performance, a more costly branch instruction is known as a conditional branch. This instruction specifies that control is to be transferred to the target instruction only if some condition, as determined by the outcome of a previous instruction, is met. Examples of conditional branch constructs include the DO LOOP and the IF/THEN/ELSE.
Subroutine linkage typically involves a call to a subroutine and a return from the subroutine back to the instruction immediately following the call. Usually, the call is done through a branch instruction which saves the address to return to in a register, while the return is done by branching indirectly through the contents of this register. For example, in the PowerPC, the branch-and-link instruction (BL) is used for the call. This instruction saves the address of the immediately following instruction in a special register referred to as the link register. The branch-using-link-register (BCLR) is used to return from the subroutine through the contents of the link register. In the System 390, the corresponding instructions are BAL or BALR for the call, and BR for the return. In this case, the link information is kept in a general purpose register that is specified with the instruction, instead of in the link register.
Subroutines pose a problem for heavily pipelined computers (those with many stages in the pipeline). Although the instruction which calls a subroutine will contain enough information to determine which is the next instruction to enter the pipeline (i.e., the first instruction in the called subroutine), the return instruction in the subroutine will not contain such information. Instead, a return instruction must pass through all of the stages of the pipeline before the return address will be known from the return instruction. If the computer waited for the return instruction to pass through the pipeline before entering another instruction, there would then be a “bubble” in the pipeline behind the return instruction in which there would be no instructions, thereby lowering the performance of the computer.
To help alleviate the penalty due to the latency of pipelines, many pipelined microprocessors use branch prediction mechanisms that predict the existence and the outcome (i.e., taken or not taken) of branch instructions within an instruction stream. The instruction fetch unit uses the branch predictions to fetch subsequent instructions.
When a branch prediction mechanism predicts the outcome of a branch instruction and the microprocessor executes subsequent instructions along the predicted path, the microprocessor is said to have “speculatively executed” along the predicted instruction path. During speculative execution the microprocessor is performing useful processing if the branch instruction was predicted correctly.
However, if the branch prediction mechanism mispredicted the branch instruction, the microprocessor is executing instructions down the wrong path and therefore accomplishes nothing. When the microprocessor eventually detects the mispredicted branch, the microprocessor must flush the instructions that were speculatively fetched from the instruction pipeline and restart execution at the correct address. The effect of the above described non-sequential operation, and of the resultant flushing of the pipeline, is exacerbated in the case of superscalar pipelined microprocessors. For example, if a branch or other interruption in the sequential instruction flow of the microprocessor occurs, the number of lost pipeline slots, or lost execution opportunities, is multiplied by the number of parallel execution units (i.e., parallel pipelines). The performance degradation due to branches and corresponding non-sequential program execution is therefore amplified in superscalar pipelined microprocessors.
Prediction of subroutine retu
F. Chau & Associates LLP
International Business Machines - Corporation
Pan Daniel
LandOfFree
Method and apparatus for subroutine call-return prediction does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus for subroutine call-return prediction, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for subroutine call-return prediction will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2511367