Scheduling operations using a dependency matrix

Electrical computers and digital processing systems: processing – Dynamic instruction dependency checking – monitoring or...

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C712S219000, C712S217000, C712S240000, C712S245000, C712S207000

Reexamination Certificate

active

06557095

ABSTRACT:

FIELD
The present invention relates to the scheduling of operations in a processor. More particularly, the present invention relates to a method and apparatus for scheduling operations using a dependency matrix.
BACKGROUND
A primary function of a processor is to perform a stream of operations, such as a stream of computer instructions. Some processors are designed to completely perform one operation in the stream before beginning to perform the next operation. With these “in-order” processors, the result of one operation is correctly used by later operations that “depend” on it. Consider the following instructions:
Load memory-
1
→register-X
Add register-X register-Y→register-Z.
The first instruction loads the content of memory-
1
into register-X. The second instruction adds the content of register-X to the content of register-Y and stores the result in register-Z. The second instruction is a “child” operation that depends on the first instruction, or “parent” operation. If the result of the first instruction is not stored in register-X before the second instruction is executed, an incorrect result will be stored in register-Z. Note that a single operation may have more than one parent, more than one child, and may be both a parent and a child with respect to different operations.
To improve a processor's performance, operations can be performed “out-of-order.” For example, if data for one instruction in a stream is not ready at a particular time, the processor may execute another instruction later in the stream. In this case, a “scheduler” can schedule instructions so that a child instruction will not be performed before its parent instruction. This improves processor performance because the processor does not remain idle until the first instruction's data is ready.
Computer instructions are not the only operations that have such dependencies. For example, memory operations may be scheduled so that information is stored into a memory location before information is read from that memory location by a later operation. Other examples include scheduling operations based on limited execution resources, memory resources, register resources, slot availability or bus availability. By way of example, the scheduling of micro-operations, also known as “&mgr;ops” or “uops,” will be used herein to describe known scheduling techniques.
FIG. 1
is an overview of a known system for processing instructions and uops. The system includes an instruction fetch and decode engine
110
that decodes an instruction stream into a series of in-order ops that represent the data flow of the instruction stream. The instructions can be decoded, for example, into uops with two logical sources and one logical destination. The uops are “issued” from the instruction fetch and decode engine
110
to a renaming and allocation unit
120
. If a processor has only a limited number of physical registers, the renaming and allocation unit
120
maps logical register references to physical register references.
The uops are then sent to a scheduler
130
, which stores several pending uops and selects from this group, or “queue,” the uop or uops that will be performed next. The scheduler
130
selects uops such that a child uop will not be performed before its parent uop. That is, the scheduler
130
decides if every source register used by a uop is ready to be used. If all of the uop's sources are ready, and if execution resources are available, the uop is sent, or “dispatched,” to a execution resource
140
where the operation is performed. Thus, uops are dispatched based on data flow constraints and resource availability, not the original ordering of the stream.
Known schedulers are typically based on the “Tomasulo” scheduler.
FIG. 2
, a block diagram of such a Tomasulo scheduler, shows two issued uops, Add
1
and Add
2
, that have been received by a scheduler
200
. Each uop has two sources and a destination. Add
1
sums the contents of register
1
(r
1
) with the contents of r
2
. The result is stored in r
3
. Add
2
-sums the contents of r
3
with the contents of r
2
and stores the result in r
4
. As can be seen, Add
2
depends on, and is the child of, Add
1
. The scheduler
200
includes a ten-bit scoreboard
210
that is used to keep track of which registers are ready. Each bit represents a register, and, for example, a “0” indicates that the register is not ready while a “1” indicates that the register is ready. If Add
1
has not been executed, the bit associated with r
3
in the scoreboard
210
is set to “0,” indicating that r
3
is not ready.
An active scheduler
220
uses the scoreboard
210
to determine if a uop is ready for dispatch. For example, the active
220
scheduler looks at the bits associated with r
3
and r
2
when considering Add
2
. If the scoreboard
210
reflects that both sources are ready, the active scheduler
220
dispatches the uop for execution. If either source is not available, the uop is not dispatched. After the uop is executed, the scoreboard
210
is updated to reflect that
4
is now ready.
FIG. 3
illustrates circuitry associated with a Tomasulo scheduler. When a uop is written, or allocated, into the Tomasulo scheduler, its sources are read from the scoreboard
210
. If the scoreboard
210
indicates that the sources are ready, the uop is ready to schedule. Sources that are ready in the scoreboard
210
are marked ready in the scheduler. Sources that are not ready will monitor the result bus. The value of a pending uop's source register
310
is matched against the value of completed uops on the destination, or result, bus using a group of compares
320
. The outputs from the group of compares
320
are input to a wide OR
330
, and the output of the wide OR is stored as a ready bit
340
for the first source. Similar logic (not shown in
FIG. 3
) is performed to generate a ready bit for the pending uop's second source. When all of the pending uop's sources are ready, as determined by the output of the logic gate
350
, the uop is ready for dispatch. This logic is repeated for each pending uop, such as entries 1 to n. If multiple uops are ready to dispatch, priority logic
360
determines which uop will be dispatched. A lookup is performed to determine the destination register
370
of the dispatching uop, and this value is driven on a result bus.
The Tomasulo scheduler uses a “tight” scheduling loop as shown in FIG.
4
. For each pending uop, the scheduler monitors the result bus and compares the destination of executed uops with the pending uop's sources at
410
. Next, the scheduler performs ready determination logic
420
to determine the dispatch readiness of the pending uop. For every source used by the pending uop, the results of the comparison performed at
410
are ORed at
430
. The results for each source are then ANDed at
440
. Only if every source is ready does the scheduler determine that the uop is ready for dispatch.
Several uops may be ready for dispatch at one time. If more than one uop is ready, prioritization is performed at
450
to determine which of the ready uops should be dispatched first. Finally, the pending uop is dispatched at
460
. When a uop is dispatched, the scheduler repeats the actions described above, resulting in the tight scheduling loop that determines when pending uops are ready for execution.
There are a number of disadvantages, however, to known scheduling techniques. For example, the basic motivation for increasing clock frequencies is to reduce instruction latency. Suppose that a part of a program contains a sequence of N instructions, I
1
, I
2
, . . . , I
N
. This part of the program may also contain any other instructions. Suppose also that each instruction requires, as an input, the result of the previous instruction. Such a program cannot be executed in less time than T=L
1
+L
2
+ . . . L
N
, where L
n
is the latency of instruction I
n
, even if the processor was capable of executing a very large number of instructions in parallel. Hence, the only way to execut

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Scheduling operations using a dependency matrix does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Scheduling operations using a dependency matrix, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Scheduling operations using a dependency matrix will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3066096

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.