Electrical computers and digital processing systems: processing – Processing control – Branching
Reexamination Certificate
1998-09-04
2001-07-17
Treat, William M. (Department: 2183)
Electrical computers and digital processing systems: processing
Processing control
Branching
C712S237000, C712S238000, C712S239000, C712S240000, C712S241000
Reexamination Certificate
active
06263427
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to computer processor having branch target buffer(s) for improving performance of branch instruction execution. The invention is useful for pipelined and non-pipelined architectures. For the pipelined architectures, this invention is useful for both single and superscalar pipelined architectures that have two or more pipelines for processing instructions.
The original computers were designed to completely process one instruction before beginning the next instruction in the sequence. Major architectural advances that have increased performance include the use of pipelined and superscalar architectures. These architectures introduce higher levels of design complexity and cost in the computer processors, however this additional cost is more than offset by the increased performance of pipelined and superscalar computers.
Performance can also be increased by use of caches in computer architectures. Caches are utilized to store and supply often used information such as data and instructions. Within one clock cycle, a cache can supply the needed information without the memory access that could consume several cycles. One example of a cache that increases performance during a branch instruction is termed a branch target buffer (“BTB”).
As mentioned briefly above, the speed of computers is increased by pipelining instructions. A pipelined computer divides instruction processing into a series of steps, or stages, each of which is preferably executable in a single clock cycle. In a non-pipelined computer, each instruction is processed until it is complete and only then does processing begin on the next instruction. In a pipelined computer, several sequential instructions are processed simultaneously in different stages of the pipeline. Processing in the different processing stages may proceed simultaneously in one clock period in separate portions of the computer.
For example, in a computer processor running pipeline instructions, each stage of the operation is handled in one clock period. The stages into which instruction processing for the processor are divided include an instruction cache fetch stage for fetching the instruction from wherever it is stored, an instruction decode stage for decoding the instruction, an address generation stage for generating the operand address(es), an operand fetch stage for fetching the operands, an execution stage for executing the instruction, and a writeback stage for writing the results of the execution to the registers and memory for later use. Each of the stages is designed to occupy one clock period. Thus during the first clock period, the instruction fetch portion of the computer fetches an instruction from storage and aligns it so that it is ready for decoding. During the second clock period, the instruction fetch portion of the computer fetches the next instruction from storage and aligns it, while the instruction decoder portion of the computer decodes the first instruction fetched. During the third clock period, the first instruction fetched is moved into the instruction issue stage while the second instruction fetched is moved into the instruction decode stage, and another instruction is moved into the instruction fetch stage. Pipelining continues through each of the stages including the execution stage and the writeback stage, and thus the overall speed of computer processing is significantly increased over a non-pipelined computer.
In a superscalar architecture, two or more instructions may be processed simultaneously in one stage. A superscalar computer has two or more processing paths that are capable of simultaneously executing instructions in parallel. In a scalar computer, the same type of instructions would be run serially. It should be apparent that if two or more instructions are run simultaneously, then the computer can process instructions faster.
If a branch instruction, such as a jump, return, or conditional branch, is in the series of instructions, a pipelined computer will suffer a substantial performance penalty on any taken branch unless there is some form of branch prediction. The penalty is caused on a taken branch because the next instructions following in the pipeline must be thrown away, or “flushed.” For example, if the microarchitecture has three stages preceding an execution stage, then the penalty will be at least three clock cycles when a branch is taken and not predicted, assuming the branch is resolved in the execution stage. This penalty is paid when the incorrect instructions are flushed from the pipeline and the correct instruction at the actual target address is inserted into the pipeline.
One way to increase the performance of executing a branch instruction is to predict the outcome of the branch instruction, and insert the predicted instruction into the pipeline immediately following the branch instruction. If such a branch prediction mechanism is implemented in a microprocessor, then the penalty is incurred only if the branch is mispredicted. It has been found that a large number of the branches actually do follow the predictions. That this is so can be exemplified by the prevalence of repetitive loops. For example, it may be found that 80% of the branch predictions are correct.
Several types of branch prediction mechanisms have been developed. One type of branch prediction mechanism uses a branch target buffer (i.e. “BTB”) that stores a plurality of entries including an index to a branch instruction. In addition to the index, each entry of the BTB table may include an instruction address, an instruction opcode, history information, and possibly other data. In a microprocessor utilizing a branch target buffer, the branch prediction mechanism monitors each instruction as it enters into the pipeline. Specifically, each instruction address is monitored, and when the address matches an entry in the branch target buffer, then it is determined that instruction is a branch instruction that has been taken before. After the entry has been located, the history information is tested to determined whether or not the branch will be predicted to be taken. Typically, the history is determined by a state machine which monitors each branch in the branch target buffer, and allocates bits depending upon whether or not a branch has been taken in the preceding cycles. If the branch is predicted to be taken, then the predicted instructions are inserted into the pipeline. Typically, the branch target entry will have opcodes associated with it for the target instruction, and these instructions are inserted directly into the pipeline. Also associated with the branch target buffer entry is an address that points to the predicted target instruction of the branch. This address is used to fetch additional instructions.
Processing the branch instruction and each following instruction then proceeds down the pipeline for several clock cycles until the branch instruction has completed the execution stage, after which the “takeness” of the branch is known. If the branch is taken, the actual branch target address of the branch will be known. If the branch has been correctly predicted, then execution will continue in accordance with the prediction. However, if the branch has been mispredicted, then the pipeline is flushed and the correct instruction is inserted into the pipeline. In a superscalar computer, which has two or more pipelines through which instructions flow side-by-side, the performance penalty on a misprediction is even greater because, in most cases, at least twice the number of instructions may need to be flushed.
As the instruction issue rate and pipeline depth of processors increases, the accuracy of branch prediction becomes an increasingly significant factor in performance. Many schemes have been developed for improving the accuracy of branch predictions. These schemes may be classified broadly as either static or dynamic. Static schemes use branch opcode information and profiling statistics from executions of the program to make predictions. Static prediction s
Cummins Sean P.
Munson Kenneth K.
Oblon & Spivak, McClelland, Maier & Neustadt P.C.
Rise Technology Company
Treat William M.
LandOfFree
Branch prediction mechanism does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Branch prediction mechanism, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Branch prediction mechanism will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2440489