Electrical computers and digital processing systems: processing – Processing architecture – Superscalar
Reexamination Certificate
1997-09-15
2001-10-30
Eng, David Y. (Department: 2155)
Electrical computers and digital processing systems: processing
Processing architecture
Superscalar
Reexamination Certificate
active
06311261
ABSTRACT:
BACKGROUND OF THE INVENTION 1. FIELD OF THE INVENTION
This invention relates generally to an apparatus and a method for improving processor microarchitecture in superscalar microprocessors. In particular, the invention relates to an apparatus and a method for a modified reorder buffer and a distributed instruction queue that increases the efficiency by reducing the hardware complexity, execution time, and the number of global wires in superscalar microprocessors that support multi-instruction issue, decoupled dataflow scheduling, out-of-order execution, register renaming, multi-level speculative execution, load bypassing, and precise interrupts.
2. Background of the Related Art
The main driving force in the research and development of microprocessor architectures is improving performance/unit cost. The true measure of performance is the time (seconds) required to execute a program. The execution time of a program is basically determined by three factors (see Patterson and Hennessey,
Computer Architecture: A Quantitative Approach
, Morgan Kaufmann Publishers, 1990); the number of instructions executed in the program (dynamic Inst_Count), the average number of clock cycles per instruction (CPI), and the processing cycle time (Clock_Period), or
T
program
=Inst_Count×CPI×Clock_Period. (1)
To improve performance (reduce execution time), it is necessary to reduce one or more factors. The obvious one to reduce is Clock_Period, by means of semiconductor/VLSI technology improvements such as device scaling, faster circuit structures, better routing techniques, etc. A second approach to performance improvement is architecture design. CISC and VLIW architectures take the approach of reducing Inst_Count. RISC and superscalar architectures attempt to reduce the CPI. Superpipelined architectures increase the degree of pipelining to reduce the Clock_Period.
The true measure of cost is dollars/unit to implement and manufacture a microprocessor design in silicon. This hardware cost is driven by many factors such as die size, die yield, wafer cost, die testing cost, packaging cost, etc. The architectural choices made in a microprocessor design affect all these factors.
It is desirable to focus on finding microarchitecture techniques/alternatives to improve the design of superscalar microprocessors. The term microprocessor refers to a processor or CPU that is implemented in one or a small number of semiconductor chips. The term superscalar refers to a microprocessor implementation that increases performance by concurrent execution of scalar instructions, the type of instructions typically found in general-purpose microprocessors. It should be understood that hereinafter, the term “processor” also means “microprocessor”.
A superscalar architecture can be generalized as a processor architecture that fetches and decodes multiple scalar instructions from a sequential, single-flow instruction stream, and executes them concurrently on different functional units. In general, there are seven basic processing steps in superscalar architectures; fetch, decode, dispatch, issue, execute, writeback, and retire.
FIG. 1
illustrates these basic steps.
First, multiple scalar instructions are fetched simultaneously from an instruction cache/memory or other storage unit. Current state-of-the-art superscalar microprocessors fetch two or four instructions simultaneously. Valid fetched instructions (the ones that are not after a branch-taken instruction) are decoded concurrently, and dispatched into a central instruction window (
FIG. 1
a
) or distributed instruction queues or windows (
FIG. 1
b
). Shelving of these instructions is necessary because some instructions cannot execute immediately, and must wait until their data dependencies and/or resource conflicts are resolved. After an instruction is ready it is issued to the appropriate functional unit. Multiple ready instructions are issued simultaneously, achieving parallel execution within the processor. Execution results are written back to a result buffer first. Because instructions can complete out-of-order and speculatively, results must be retired to register file(s) in the original, sequential program order. An instruction and its result can retire safely if it completes without an exception and there are no exceptions or unresolved conditional branches in the preceding instructions. Memory stores wait at a store buffer until they can commit safely.
The parallel executions in superscalar processors demand high memory bandwidth for instructions and data. Efficient instruction bandwidth can be achieved by aligning and merging the decode group. Branching causes wasted decoder slots on the left side (due to unaligned branch target addresses) and on the right side (due to a branch-taken instruction that is not at the end slot). Aligning shifts branch target instructions to the left most slot to utilize all decoder slots. Merging fills the slots to the right of a branch-taken instruction with the branch target instructions, combining different instruction runs into one dynamic instruction stream. Efficient data bandwidth can be achieved by load bypassing and load forwarding (M. Johnson,
Superscalar Microprocessor Design
, Prentice-Hall, 1991), a relaxed or weak-memory ordering model. Relaxed ordering allows an out-of-order sequence of reads and writes, to optimize the use of the data bus. Stores to memory cannot commit until they are safe (retire step). Forcing loads and stores to commence in order will delay the loads significantly and stall other instructions that wait on the load data. Load bypassing allows a load to bypass stores in front of it (out-of-order execution), provided there is no read-after-write hazard. Load forwarding allows a load to be satisfied directly from the store buffer when there is a read-after-write dependency. Executing loads early is safe because load data is not written directly to the register file.
Classic superscalar architectures accomplish fine-grain parallel processing at the instruction level, which is limited to a single flow of control. They cannot execute independent regions of code concurrently (multiple flows of control). An instruction stream external to superscalar processors appears the same as in CISC or RISC uniprocessors; a sequential, single-flow instruction stream. It is internally that instructions are distributed to multiple processing units. There are complexities and limitations involved in parallelizing a sequential, single-flow instruction stream. The following six superscalar features—multi-instruction issue, decoupled dataflow scheduling, out-of-order execution, register renaming, speculative execution, and precise interrupts—are key in achieving this goal. They help improve performance and ensure correctness in superscalar processors.
Multi-instruction issue is made possible by widening a conventional, serial processing pipeline in the “horizontal” direction to have multiple pipeline streams. In this manner multiple instructions can be issued simultaneously per clock cycle. Thus, superscalar microprocessors must have multiple execution/functional units with independent pipeline streams. Also, to be able to sustain multi-instruction issue at every cycle, superscalar microprocessors fetch and decode multiple instructions at a time.
Decoupled dataflow scheduling is supported by buffering all decoded instructions into an instruction window(s), before they are scheduled for execution. The instruction window(s) essentially “decouples” the decode and execute stage. There are two primary objectives. The first is to maintain the flow of instruction fetching and decoding by not forcing a schedule of the decoded instructions right away. This reduces unnecessary stalls. Instructions are allowed to take time to resolve data dependencies and/or resource conflicts. The second is to improve the look-ahead capability of the processor. With the instruction window, a processor is now able to look ahead beyond the stalled instructions to discover others that are ready to execute. The issue
Alford Cecil O.
Chamdani Joseph I.
Eng David Y.
Georgia Tech Research Corporation
Thomas Kayden Horstemeyer & Risley LLP
LandOfFree
Apparatus and method for improving superscalar processors does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Apparatus and method for improving superscalar processors, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Apparatus and method for improving superscalar processors will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2591602