Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
2000-03-31
2002-02-19
Kim, Matthew (Department: 2186)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S124000, C711S145000, C711S146000
Reexamination Certificate
active
06349361
ABSTRACT:
BACKGROUND
1. Technical Field
The present invention generally relates to computer processing systems and, in particular, to methods and apparatus for reordering and renaming memory references in a multiprocessor computer system.
2. Background Description
Contemporary high-performance processors rely on superscalar, superpipelining, and/or very long instruction word (VLIW) techniques for exploiting instruction-level parallelism in programs (i.e., for executing more than one instruction at a time). In general, these processors contain multiple functional units, execute a sequential stream of instructions, are able to fetch from memory more than one instruction per cycle, and are able to dispatch for execution more than one instruction per cycle subject to dependencies and availability of resources.
The pool of instructions from which the processor selects those that are dispatched at a given point in time is enlarged by the use of out-of-order execution. Out-of-order execution is a technique by which the operations in a sequential stream of instructions are reordered so that operations appearing later are executed earlier, if the resources required by the later appearing operations are free. Thus, out-of-order execution reduces the overall execution time of a program by exploiting the availability of the multiple functional units and using resources that would otherwise be idle. Reordering the execution of operations requires reordering the results produced by those operations, so that the functional behavior of the program is the same as what would be obtained if the instructions were executed in their original sequential order.
In the case of memory-related operations, a memory store operation stores a datum in memory. A later memory load operation may read this datum from memory, load the datum into a processor register and, as is frequently the case, start a sequence of operations that depend on the datum. When directly bypassing such values from the store operation to a subsequent load operation, a slow main memory access may be substituted by a faster register-to-register access. In addition to using idle resources, the bypassing of such values reduces the critical path (i.e., the sequence of operations which determines the minimum execution time possible for a given code fragment) and reduces the number of memory operations which must be processed by the memory system. An additional performance improvement can be achieved by speculatively executing store operations out-of-order. Other benefits are the ability to reorder multiple store and load references to the same memory location by using a technique referred to as “renaming of memory locations”.
In general, there are two basic approaches to implementing out-of-order execution and reordering of results: dynamic reordering and static reordering. In dynamic reordering, the instructions are analyzed at execution time, and the instructions and results are reordered in hardware. In static reordering, a compiler/programmer analyzes and reorders the instructions and the results produced by those instructions when the program is generated, thus the reordering tasks are accomplished through software. These two approaches can be jointly implemented.
To ensure that such operations are performed correctly, there must exist a mechanism to undo speculative memory references. Furthermore, store operations to the same address must be presented to the main memory in original program order.
In a multiprocessor environment, additional restrictions are posed on the ordering of memory operations. To achieve predictable and repeatable computation of programs in a multiprocessor environment, a requirement of ‘sequential consistency’ is described in the article by L. Lamport, “How to Make a Multiprocessor that Correctly Executes Multiprocess Programs”, IEEE Transactions on Computers, C-28(9), pp. 690-91 (September 1979). The article by Lamport defines a multiprocessor system as sequentially consistent if “the result of any execution is the same as if the operations of all processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program”. For static speculative execution, the order of the original logical program text is authoritative, not the reordered program text, and the compiler and hardware implementation must collaborate to generate an execution equivalent to that original order.
To achieve proper performance while simplifying coherence protocols between multiple processors in a computer system, several relaxations of the above described strictly sequential consistent order are possible. The types of re-ordering which are allowable depend on the memory consistency model guaranteed by a particular implementation. An overview of current and proposed consistency models and their characteristics is provided in the article by S. Adve and K. Gharachorloo, “Shared Memory Consistency Models: A Tutorial”, Technical Report 9512, Dept. of Electrical and Computer Engineering, Rice University, Houston, Tex. (September 1995).
Typically, these requirements impose additional restrictions on the order of multiple store operations (even those store operations referring to different addresses), and of load and store operations (executed by the same processor or different processors, and referring to the same address or different addresses) with respect to each other.
While these requirements guarantee the correct operation of programs designed to work in the context of such coherency protocols, they impose limitations on the order of operation as instructions are executed in the processor. To achieve high performance while adhering to processor consistency models, a processor must be able to reorder memory operations internally and bypass results between them, but present the memory operations to the memory system in-order.
Accordingly, it would desirable and highly advantageous to have support for the following features in a high performance memory interface of a multiprocessor computer system implementing out-of-order execution, so as to provide maximum scheduling freedom:
1. The ability to execute store operations out-of-order, but retire them to memory in-order.
2. The ability to speculatively perform store operations, coupled with the ability to undo such store operations transparently (i.e., without influencing the correctness of program execution in a multiprocessor system).
3. The ability to hold multiple store result values for the same memory address, and resolve load references to these values, while at the same time retiring store values in original program order to the memory system.
Some example code sequences will now be given to illustrate the performance impact of implementing the above features in a processor supporting out-of-order execution of store operations.
With respect to the ability to execute store operations out-of-order with respect to each other, consider the following in-order code fragment, where the latency of a multiply (MUL) operation is three cycles, and that of all other operations is 1 cycle.
MUL
r3 = r4, r5
ST
r3, 16(fp)
LI
r4 = 0
ST
r4, 20(fp)
The preceding code fragment will execute on a single issue out-of-order processor without out-of-order execution of store operations in 5 cycles as follows:
MUL
r3 = r4, r5
LI
r4 = 0
NOP
ST
r3, 16(fp)
ST
r4, 20(fp)
With the capability to perform store operations out-of-order, the processor requires only four cycles using the following schedule:
MUL
r3 = r4, r5
LI
r4 = 0
ST
r4, 20(fp)
ST
r3, 16(fp)
With respect to control-speculative execution of store operations, consider the following code fragment:
CMPLIcr0, r4, 0
BTRUEcr0.eq, label
ST r4, 12(fp)
label:
If we assume that the branch is predicted to not be taken most of the time, and branch resolution requires 3 cycles from a compare operation to a branch operation using the condition, then the above code requires 5 cycles to execute even if the branch is correctly pr
Altman Erik
Ebcioglu Kemal
Gschwind Michael
Sathaye Sumedh
Elmore Stephen
F. Chau & Associates LLP
Kim Matthew
LandOfFree
Methods and apparatus for reordering and renaming memory... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Methods and apparatus for reordering and renaming memory..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Methods and apparatus for reordering and renaming memory... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2980958