Data processing: software development – installation – and managem – Software program development tool – Translation of code
Reexamination Certificate
1998-04-24
2002-01-22
Dam, Tuan Q. (Department: 2122)
Data processing: software development, installation, and managem
Software program development tool
Translation of code
C717S152000, C712S207000
Reexamination Certificate
active
06341370
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to a method and apparatus for executing instructions in a computer. More specifically, the present invention relates to a method and apparatus for improving the performance of assembly code generated by an optimizing compiler. This is done by integrating data prefetching and modulo scheduling, and by inserting prefetch instructions generated after module scheduling has been performed.
Modern processors employ numerous techniques to provide the performance that today's applications require. One such technique is instruction pipelining. Instruction pipelining is a processing technique whereby multiple instructions are overlapped in execution. This simultaneous execution of instruction is sometimes referred to as instruction-level parallelism. It is a goal of pipelined architectures to maximize instruction-level parallelism.
In a processor having a pipelined architecture, one means of increasing instruction-level parallelism is the use of multiple pipelines, in which instructions are issued using a scheduler or similar hardware construct. Processors employing such constructs are commonly known as superscalar processors. Instructions may be scheduled for issue to the pipelines based on numerous factors, such as pipeline availability, op-code type, operand availability, data dependencies, and other factors. In such architectures, the processor's pipelines must be used efficiently to maximize instruction-level parallelism. This means that the pipelines are fed instructions as quickly as possible, while pipeline stalls are minimized.
To ensure that the various pipelines in superscalar processors are efficiently used, complex compilers have been created to generate code that takes maximum advantage of the of superscalar processors' capabilities. Such compilers orchestrate the issue and execution of instructions to maximize instruction level parallelism and so, throughput. One method employed in such compilers is modulo scheduling.
FIG. 1
illustrates the execution of seven iterations of a pipelined loop as scheduled for execution by a compiler having modulo scheduling capabilities. Modulo scheduling achieves instruction level parallelism by beginning the execution of one or more iterations of a loop before the previous iteration has completed. This is done by issuing instructions in the subsequent iteration(s) to available pipelines within the superscalar processor. A concept fundamental to modulo scheduling is initiating new iterations at fixed intervals. The interval employed is referred to as the initiation interval or iteration interval (II), and is exemplified in
FIG. 1
by an iteration interval
100
.
The scheduled length of a single iteration (trip length, or TL) is divided into stages, each one of a length equal to iteration interval
100
. The number of stages that each iteration requires may be defined as:
SC
=
TL
II
(
1
)
where SC is stage count. The three phases of loop execution (after modulo-scheduling) are shown in
FIG. 1
as a prologue
102
, a kernel
104
, and an epilogue
106
. During prologue
102
and epilogue
106
, not all stages of successive iterations execute. Only during kernel
104
are all stages of the loop being executed. Prologue
102
and epilogue
106
last for (SC-1)*II machine cycles. The number of times the loop is to be iterated is known as the trip count. If the trip count is relatively large, kernel
104
will last much longer than prologue
102
or epilogue
106
. The primary performance metric for a modulo-scheduled loop is the iteration interval. It is a measure of the steady state throughput for loop iterations. Smaller iteration interval values imply higher throughput. Therefore, a modulo scheduler preferably attempts to derive a schedule that minimizes the iteration interval. The time required to execute n iterations is:
T(n)=(n+SC−1)·II (2)
The average time to execute one of these iterations is effectively:
T
⁡
(
n
)
iteration
=
(
n
+
SC
-
1
)
·
II
n
(
3
)
As can be seen from Equation 3, T(n)
iteration
approaches II as n approaches infinity.
The execution of the loop scheduled in
FIG. 1
begins with stage
0
of a first iteration
110
. During the first II machine cycles, no other iteration executes concurrently. This exemplifies iteration interval
100
. This also marks the beginning of prologue
102
. After the first II machine cycles, first iteration
110
enters stage
1
and a second iteration
112
enters stage
0
. New iterations (e.g., a third iteration
114
) join every II machine cycles until a state is reached when all stages of different iterations are executing. This marks the beginning of kernel
104
, and is exemplified by the execution of a fourth iteration
116
, a fifth iteration
118
, a sixth iteration
120
, and a seventh iteration
122
. Toward the end of loop execution, during epilogue
106
, no new iterations are initiated and those that are in various stages of progress gradually complete.
Scheduling in a compiler employing modulo scheduling may proceed in a manner similar to the following. The data dependence graph (DDG), a directed graph, is constructed for the loop being scheduled. The nodes in this graph correspond to instructions, with the arcs corresponding to dependencies between them. Two attributes the arcs possess are latency and the dependence distance (also referred to as omega or “&OHgr;”). Latency is the number of machine cycles of separation required between a source instruction and a sink (or destination) instruction. A source instruction usually provides some or all of the data used by a destination instruction. Omega represents the iteration distance between the two nodes (instructions). In other words, omega represents the number of loop iterations from the source instruction to the destination instruction. For example, data generated by a source instruction in the first iteration may not be needed until the third iteration, equating to an omega of two.
Prior to scheduling, two bounds on the maximum throughput are derived: the minimum iteration interval (MII) and the recurrence minimum iteration interval (RMII). The MII is a bound on the minimum number of machine cycles needed to complete one iteration and is based only on processor resources. For example, if a loop has ten add operations and the processor is capable of executing at most two add operations per machine cycle, then the add unit resource would limit throughput to at most one iteration every five machine cycles. The MII is computed by determining the maximum throughput for each resource in terms of iterations per machine cycle, in turn, and taking the minimum of those maximum throughput values as the processor's maximum guaranteed throughput.
The RMII is a bound on the minimum number of clocks needed to complete one iteration and is based only on dependencies between nodes. DDG cycles imply that a value x
i
computed in some iteration i is used in a future iteration j and is needed to compute a similarly propagated value in iteration j. These circular dependencies place a limit on how rapidly iterations can execute because computing the values needed in the DDG cycle takes time. For each elementary DDG cycle, the ratio of the sum of the latencies (l) to the sum of the omegas (d) is computed. This value limits the iteration throughput because it takes l machine cycles to compute values in a DDG cycle that spans d iterations.
The fixed spacing between overlapped iterations forces a constraint on the scheduler other than the normal constraints imposed by the arcs in the DDG. Note that placing an operation at a time t implies that there exists a corresponding operation in the k
th
future iteration at (t+k*II). Operations that use the same resource must be scheduled such that they are executed at different times, modulo the II, to avoid stalls caused by a resource being in use. This is referred to as the modulo constraint. The modulo constraint implies that if an operation uses a resource at time t
1
and a
Mahadevan Rajagopalan
Tirumalai Partha Pal
Campbell III Samuel G.
Dam Tuan Q.
Nguyen-Ba Hoang-vu Antony
Rifai D'Ann Naylor
Skjerven Morrill & MacPherson LLP
LandOfFree
Integration of data prefetching and modulo scheduling using... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Integration of data prefetching and modulo scheduling using..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Integration of data prefetching and modulo scheduling using... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2844635