Incorporating local branch history when predicting multiple...

Electrical computers and digital processing systems: processing – Processing control – Branching

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C712S239000

Reexamination Certificate

active

06738897

ABSTRACT:

BACKGROUND OF THE INVENTION
Computer processors contain arithmetic, logic, and control circuitry that interpret and execute instructions from a computer program. Referring to
FIG. 1
, a typical computer system includes a microprocessor (
22
) having, among other things, a CPU (
24
), a memory controller (
26
), and an on-board cache memory (
30
). The microprocessor (
22
) is connected to external cache memory (
32
) and a main memory (
34
) that both hold data and program instructions to be executed by the microprocessor (
22
). Internally, the execution of program instructions is carried out by the CPU (
24
). Data needed by the CPU (
24
) to carry out an instruction are fetched by the memory controller (
26
) and loaded into internal registers (
28
) of the CPU (
24
). Upon command from the CPU (
24
), the memory controller (
26
) searches for the data first in the fast on-board cache memory (
30
), then in external cache memory (
32
), and finally in the slow main memory (
34
). Finding the data in the cache memory is referred to as a “hit.” Not finding the data in the cache memory is referred to as a “miss.”
The time between when a CPU requests data and when the data is retrieved and available for use by the CPU is termed the “latency” of the system. If requested data is found in cache memory, i.e., a data hit occurs, the requested data can be accessed at the speed of the cache and the latency of the system is reduced. If, on the other hand, the data is not found in cache, i.e., a data miss occurs, and thus the data must be retrieved from main memory for access and the latency of the system is increased.
In the pursuit of improving processor performance, designers have sought two main goals: making operations faster and executing more operations in parallel. Making operations faster can be approached in several ways. For example, transistors can be made to switch faster and thus propagate signals faster by improving semiconductor processes; execution-unit latency can be reduced by increasing the number of transistors in the design; and the levels of logic required by the design to implement a given function can be minimized to increase speed. To execute more operations in parallel, designers mainly rely on one, or a combination of pipelining and superscalar techniques. Pipelined processors overlap instructions in time on common execution resources. Superscalar processors overlap instructions in space on separate resources.
Pipeline stalls are a main performance inhibitor with regard to parallel processing. Stalls arise from data dependencies, changes in program flow, and hardware resource conflicts. At times, pipeline stalls can be avoided by rearranging the order of execution for a set of instructions. Compilers can be used to statically reschedule instructions. However, incomplete knowledge of run-time information reduces the effectiveness of static rescheduling. In-order processors, i.e., processors that issue, execute, complete, and retire instructions in strict program order, have to rely entirely on static rescheduling and thus are prone to pipeline stalls.
As a result, designers use out-of-order processors and seek to implement dynamic instruction rescheduling. The simplest out-of-order processors issue instructions in order but allow them to execute and complete out of order. Even these simple out-of-order processors require complex hardware to reorder results before the corresponding instructions are retired. A strict result order is not required from a data-flow perspective. However, such ordering is necessary to maintain precise exceptions and to recover from mispredicted speculative execution.
A well-known method of reordering is through the use of a reorder buffer, i.e., a buffer that maintains results until written to the register file in program order. Designers also use other types of reordering hardware, such as history buffers and future files. History buffers record source-operand history so the processor can backtrack to a precise architectural state and future files store the current state and the architectural state in separate register files allowing the processor to be restored to a precise checkpoint state.
Branch prediction and speculative execution are additional techniques used to reduce pipeline stalls. In a pipelined processor, the outcomes of branch instructions are often determined after subsequent instructions have been fetched. Using branch prediction schemes, microprocessors attempt to accurately predict whether or not a branch is taken based on how that branch has behaved previously. The aggregate behavior, or the average behavior over time, of the branch instruction is stored in a Branch Prediction Table (“BPT”). Given a branch instruction's aggregate behavior, the branch predictor, which resides in an instruction fetch unit, predicts the outcome of the branch instruction and then fetches instructions thereafter based on that prediction. For example, if the branch predictor predicts that a branch will be taken, then the processor fetches the instructions at the instruction address to which the predicted instruction branches. When the branch proceeds in the predicted direction, pipeline stalls are completely avoided. On the other hand, if the branch direction is mispredicted, all the instructions after the mispredicted instruction must be removed from the processor.
Modern microprocessors incorporate a variety of branch prediction schemes. These schemes usually fall under one of two broad classifications: static branch prediction and dynamic branch prediction. Static branch prediction occurs when a branch predictor makes predictions that are not based on the run-time behavior of branch instructions. Two such schemes are: “predict not taken” and “predict taken.” In the “predict not taken” scheme, a branch is predicted as not taken, and the processor simply continues as if the branch did not exist. In the “predict taken” scheme, as soon as the branch is decoded and the target address of the next instruction is determined,. it is assumed that the branch is taken and the process continues with the fetching and executing of instructions at the target address.
Dynamic branch prediction, on the other hand, occurs when the processor responds to changes in a branch instruction's behavior while a program is executing. In other words, a dynamic branch prediction scheme provides a mechanism by which a processor can take into account the cumulative behavior of a branch instruction. In cases where there are more than one branch instruction in an instruction bundle, also known as a fetch bundle, some schemes may break the fetch bundle at the points where additional branch instructions reside.
Typically, dynamic branch prediction schemes are extended to predict multiple branches. The increasing number of instructions executed per cycle in high-performance microprocessors calls for the instruction fetch unit to fetch a correspondingly larger number of instructions each cycle. Fetching more instructions per cycle increases the likelihood that more than one branch prediction will need to be made each cycle to determine the next fetch address. Accordingly, a microprocessor handles multiple branch instructions through a multiple branch prediction scheme.
Multiple branch prediction schemes usually depend upon a fetch bundle address (FBA), i.e., the address of the first instruction fetched in a given cycle, to index all prediction structures for the fetched instruction bundle. Each branch to be predicted in the current fetch bundle uses information from the prediction structures indexed by the address of the bundle currently being fetched.
An index based upon the fetch bundle address has typically been used to select several prediction counters from which multiple branch predictions are made dependent upon the location of the branches in the fetch bundle. These counters may be simple two-bit counters with saturation. Typically, the counter is incremented when the branch instruction is taken and it is decremented when the instruction is not taken.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Incorporating local branch history when predicting multiple... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Incorporating local branch history when predicting multiple..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Incorporating local branch history when predicting multiple... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3199132

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.