Method for reducing number of register file ports in a wide...

Electrical computers and digital processing systems: processing – Processing architecture – Superscalar

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C712S028000, C712S017000, C712S018000, C712S217000, C712S218000

Reexamination Certificate

active

06263416

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention generally relates to superscalar processors, and more particularly, to a method for reducing the number of register file ports in a very wide instruction issue processor.
2. Description of the Related Art
To gain performance, current machine architectures have become aggressive in issuing and executing multiple instructions per clock. As explained in further detail below, this almost linearly increases the number of read and write ports to the architectural register file of the chip. Moreover, speculative execution is a common technique employed in implementing such machines, which in turn requires the provision of an additional reorder buffer register file. Thus, when instructions are issued in such machines to execution units, the number of ports are very high on both the architectural register file and the reorder buffer register file. This makes the register files heavily metal limited, resulting in the dual drawbacks of increasing the metal area and worsening the timing characteristics. Rapidly accessing the operand data is critical in most of these machines, and thus the register file timing becomes a performance bottleneck.
Referring to
FIG. 1
, the basic relationship between the number of register ports and the issue number of a machine will be described.
FIG. 1
generally depicts the operation of a superscalar machine. Reference characters IF denote fetching an instruction (such as “add” r
1
and r
2
to obtain r
3
), and characters ID denote the fetching of data needed to carry out the instruction (such as r
1
and r
2
). The instructions and data are loaded in a register file, whereby the data is applied to the appropriate one of parallel execution units (such as ALUs). In the case of two execution units running in parallel, the processor is said to have a superscalar degree of 2, or in other words, is a 2-issue machine. Four data (two for each issue) are simultaneously supplied from the register file to the execution units, and thus four read register ports would be needed. Similarly, in the case of a 4-issue machine, the register would be equipped with eight ports, whereas an 8-issue machine would require sixteen register read ports. Also, in some cases the execution units, such as store execution units, will require the provision of three ports.
In addition, due to dependencies among instructions and a lack of parallelism in the program code, reorder buffers as mentioned above are additionally provided, further increasing the port requirements. Assume, for example, the case of an 4-issue machine in which the four instructions shown in
FIG. 2
have been fetched for execution. As can be seen, the third instruction “2” is dependent on the execution results of the first instruction “0”. That is, the value of r
1
needed for r
4
←r
18
, r
1
will not be know until after execution of r
1
←r
2
, r
3
. Thus, if these four instructions were simultaneous applied to the machine's execution pipeline, erroneous calculations may result. Instruction dependencies such as this were one factor leading to the so-called “out of order” execution discussed below.
Reference is now made to
FIG. 3
for a general explanation of an “out-of-order” machine. The out-of-order machine is capable of scanning the fetched instructions to identify those that are dependent and those that are independent. Consider the example of an 8-issue machine, and assume, as shown in
FIG. 3
, three sets of eight instructions each, for a total of 24 instructions under consideration. As also shown, assume the second and sixth instructions of the first set are dependent, and that there are no dependent instructions in either the second or third sets. These instructions are loaded into an issue window or instruction window of the machine. A scheduling algorithm identifies the independent instructions within the instruction window whose operands have been completed (and for which an execution module is available), and loads the first eight of the independent instructions in the instruction pipeline. These would be instructions
1
,
3
-
5
and
7
-
10
in FIG.
3
. Then, assuming that the operands for instructions
2
and
6
have been resolved, these instructions together with instructions
11
-
16
may be applied to the pipeline in a next execution cycle.
Conventionally, out-of-order execution for an 8-issue machine is implemented as shown in FIG.
4
. Eight instructions are received in order. Each instruction is made up of an instruction identifier lid, a logical destination address Lid and at least two operand identifiers ser. The logical destination addresses Lid identify which register of an architectural register file ARF
408
that a corresponding instruction result is to be deposited, and are stored in order in a dependency chain table DCT
402
at corresponding instruction identifier addresses lid of the DCT
402
.
As already mentioned, the instructions arrive eight at a time in an order dictated by the program code. These instructions are stored, in order, in eight of the one-hundred twenty-eight registers of the central instruction window CIW
404
. By searching the destination addresses Did contained in the DCT
402
, a scheduling algorithm identifies the dependent instructions within the CIW
404
whose operands have been not completed. Only the first eight independent entries are applied to a bypass matrix
410
. The bypass matrix
410
receives the operand data from from multiple sources including, but not limited to, the ARF
408
and/or a reorder buffer ROB
406
, and routes the data to the respectively appropriate execution units
412
. The execution units
412
, for example, are arithmetic logic units and the like.
The reorder buffer ROB
406
temporarily stores the results of the execution units
412
, and for this reason, the ROB
406
is equipped with eight write ports. Each result is stored in the ROB
406
at an address which corresponds to the physical register identifier Rid, which is the transformed logical destination address Lid once it passes through the DCT
402
. These results remain in the ROB
408
until they are “retired” to the architectural register file
408
, at which time the data is stored at the appropriate logical destination address Lid within the ARF
408
. In this example, the ARF
408
has 160 registers.
In the example, up to eight data at a time can be retired into the ARF
408
from the ARF
408
, and thus the ARF
408
is equipped with eight read ports and the ARF is equipped with eight write ports. However, all eight data must satisfy the retirement criteria, and thus, in some cases less than eight data may be retired in a given cycle. In order for a data to be retired, all previous data must be present. In other words, there can be no retirement of the results of a given instruction into the ARF
408
until all prior instructions have been executed and stored.
In addition, the occurrence of a so-called “trap” results in the “flushing” of all subsequent data already stored in the ROB
406
. Traps are internal errors or exceptions, such as divide-by-zero and arithmetic overflows. Keeping in mind that the instructions are executed out-of-order (relative to the program code), it is possible for a trap to occur after later-ordered instructions have been executed and the corresponding results stored in the ROB
406
. A trap results in the deletion of all subsequent data of the ROB
406
. In this way, the integrity of the data contained in the ARF
408
is assured.
The configuration of
FIG. 4
also demands the provision of read ports from both of the ROB
406
and the ARF
408
. This is because the possibility exists that some or even all of the data needed to execute the eight instructions is contained in one of the ROB
406
or the ARF
408
. In the example here, eighteen read ports extend from the ROB
406
to the bypass matrix
406
and an additional eighteen read ports extend from the ARF
408
to the bypass matrix
406
. The number of ports (eighteen in this example) is

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method for reducing number of register file ports in a wide... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method for reducing number of register file ports in a wide..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method for reducing number of register file ports in a wide... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2484627

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.