Electrical computers and digital processing systems: processing – Processing control – Branching
Reexamination Certificate
1999-07-23
2003-02-18
Coleman, Eric (Department: 2183)
Electrical computers and digital processing systems: processing
Processing control
Branching
C712S235000
Reexamination Certificate
active
06523110
ABSTRACT:
BACKGROUND
1. Technical Field
The present invention relates generally to computer processing systems and, in particular, to a decoupled fetch-execute engine with static branch prediction support.
2. Background Description
Early microprocessors generally processed instructions one at a time. Each instruction was processed using four sequential stages: instruction fetch; instruction decode; instruction execute; and result writeback. Within such microprocessors, different dedicated logic blocks performed each different processing stage. Each logic block waited until all the preceding logic blocks completed operations before beginning its operation.
Improved computational speed has been obtained by increasing the speed with which the computer hardware operates and by introducing parallel processing in one form or another. “Superscalar” and “Very Long Instruction Word” (VLIW) microprocessors have been recently introduced to implement parallel processing. They have multiple execution units (e.g., multiple integer arithmetic logic units (ALUs)) for executing instructions and, thus, having multiple “pipelines”. As such, multiple machine instructions may be executed simultaneously in a superscalar or VLIW microprocessor, providing obvious benefits in the overall performance of the device and its system application.
For the purposes of this discussion, latency is defined as the delay between the fetch stage of an instruction and the execution stage of the instruction. Consider an instruction which references data stored in a specified register. Such an instruction requires at least four machine cycles to complete. In the first cycle, the instruction is fetched from memory. In the second cycle, the instruction is decoded. In the third cycle, the instruction is executed and, in the fourth cycle, data is written back to the appropriate location.
To improve efficiency and reduce instruction latency, microprocessor designers overlapped the operations of the fetch, decode, execute, and writeback logic stages such that the microprocessor operated on several instructions simultaneously. In operation, the fetch, decode, execute, and writeback logic stages concurrently process different instructions. At each clock pulse the result of each processing stage is passed to the subsequent processing stage. Microprocessors that use the technique of overlapping the fetch, decode, execute, and writeback stages are known as “pipelined” microprocessors. In principle, a pipelined microprocessor can complete the execution of one instruction per machine cycle when a known sequence of instructions is being executed. Thus, it is evident that the effects of the latency time are reduced in pipelined microprocessors by initiating the processing of a second instruction before the actual execution of the first instruction is completed.
Unfortunately, various scenarios may exist where a stall is induced in a pipelined microprocessor. One such scenario is the branch instruction. In general, the instruction flow in a microprocessor requires that instructions are fetched and decoded from sequential locations in memory. A branch instruction is an instruction that causes a disruption in this flow, e.g., a taken branch causes decoding to be discontinued along the sequential path, and resumed at a new location in memory. The new location in memory may be referred to as a target address of the branch. Such an interruption in pipelined instruction flow results in a substantial degradation in pipeline performance.
As architecture and compiler designers continue to strive for greater degrees of parallelism, the effect of pipeline stall penalties on parallelism becomes very significant. For high levels of parallelism, the average number of cycles spent executing an instruction (CPI) must be much less than 1. Such a small CPI is only possible by minimizing the CPI penalties from stalls, thereby reducing their impact upon pipeline throughput. The problem of reducing stall penalties is aggravated by the potentially greater frequency of stalls due to higher instruction issue rates. It becomes necessary to find more capable methods for decreasing these penalties. Two common methods for reducing stall penalties include decoupled architectures and branch prediction.
Decoupled architectures use buffering and control mechanisms to dissociate memory accesses from the rest of the pipeline. When a cache miss occurs, the decoupled architecture allows the rest of the pipeline to continue moving forward, only stalling those instructions dependent upon that cache access. Decoupling of cache accesses from the pipeline can be used with either instruction or data caches. Decoupling of the instruction cache from the execute pipeline, hereafter referred to as decoupled fetch-execute, is beneficial for both superscalar and EPIC/VLIW (Very Long Instruction Word) architectures. The EPIC architecture is further described by L. Gwennap, in “Intel, HP make EPIC disclosure”, Microprocessor Report, 11(14): Oct. 1-9, 1997.
Decoupled fetch-execute architectures use instruction buffers and branch prediction to enable instruction fetching to be independent from the rest of the pipeline. The instruction buffers are organized as a queue that receives instructions as they are fetched from the instruction cache. As instructions enter the queue, a branch prediction mechanism checks for the existence of a branch instruction. When a branch is found, the branch prediction mechanism predicts the likely branch target (address) and direction. If necessary, the branch prediction mechanism redirects the instruction fetch to the predicted address.
Most general-purpose processors today use dynamic branch prediction mechanisms, which select at execution time the direction a branch is expected to take. Dynamic branch prediction mechanisms can include tables of prediction counters, history tables, and branch target buffers. Many of these schemes add considerable hardware, and may affect the processor frequency. Dynamic branch prediction schemes are described by: Calder et al., in “A System Level Perspective on Branch Architecture Performance”, Proceedings of the 16
th
Annual International Symposium on Computer Architecture, pp. 199-206, May 1989; and Chang et al., in “Comparing software and hardware schemes for reducing the cost of branches”, Proceedings of the 16
th
Annual International Symposium on Computer Architecture, pp. 224-233, May 1989.
Static branch prediction provides an alternate prediction method, as it corresponds to selecting at compile time the direction a branch is expected to take. Static branch prediction does not perform as well as dynamic branch prediction for most general-purpose applications, but does do well in some application markets, so architectures for these markets may be able to forego the cost of dynamic branch prediction. Such markets include media processing and binary translation in software, which performs run-time compilation using dynamic profile statistics, enabling accurate static branch prediction. Media processing is further described by Fritts et al., in “Understanding multimedia application characteristics for designing programmable media processors”, SPIE Photonics West, Media Processors '99, San Jose, Calif., January 1999. Binary translation in software is described by Altman et al., in “Execution-based Scheduling for VLIW Architectures”, submitted to Euro-Par '99, Toulouse, France, September 1999.
In static branch prediction, conditional branches that are predicted as not taken, i.e. those expected to fall through to the sequential path, are easily supported since instruction fetch logic automatically continues sequentially. Unconditional branches and conditional branches that are predicted as taken, i.e. those expected to continue execution at a non-sequential target instruction, require support for redirecting the instruction fetch unit to begin prefetching the expected branch target prior to execution of the branch. It is desired that this prefetching begin immediately after the fetch of the branch instruction to enab
Bright Arthur A.
Fritts Jason E.
Coleman Eric
F. Chau & Associates LLP
International Business Machines - Corporation
LandOfFree
Decoupled fetch-execute engine with static branch prediction... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Decoupled fetch-execute engine with static branch prediction..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Decoupled fetch-execute engine with static branch prediction... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3163754