Electrical computers and digital processing systems: processing – Processing architecture – Superscalar
Reexamination Certificate
2000-08-18
2004-06-29
Tsai, Henry W. H. (Department: 2183)
Electrical computers and digital processing systems: processing
Processing architecture
Superscalar
C712S028000, C712S034000
Reexamination Certificate
active
06757807
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention pertains generally to processor architecture, focussing on the execution units. More particularly this invention is directed to an improved processor using clustered groups of execution units visible at the macro-architecture level, facilitating improved parallelism and backwards compatibility in a processor instruction set.
2. The Prior Art
As reliance on computer systems has increased so have demands on system performance. This has been particularly noticeable in the past decade as both businesses and individual users have demanded far more than the simple character cell output on dumb terminals driven by simple, non-graphical applications typically used in the past. Coupled with more sophisticated applications and internet use, the demands on the system and in particular the main processor are increasing at a very high rate.
As is well known in the art a processor is used in a computer system, where the computer system as a whole is of conventional design using well known components. An example of a typical computer system is the Sun Microsystems Ultra 10 Model 333 Workstation running the Solaris v.7 operating system. Technical details of the example system may be found on Sun Microsystems' website.
A typical processor is shown in block diagram form in FIG.
1
. Processor
100
contains a Prefetch And Dispatch Unit
122
which fetches and decodes instructions from main memory (not shown) through Memory Management Unit
110
, Memory Interface Unit
118
, and System Interconnect
120
. In some cases, the instructions or their operands may be in non-local cache in which case Prefetch And Dispatch Unit
122
uses External Cache Unit
114
to access external cache RAM
116
. Instructions that are decoded and waiting for execution may be stored in Instruction Cache And Buffer
124
. Prefetch And Dispatch Unit
122
detects which type of instruction it has, and sends integer instructions to Integer Execution Unit
126
and floating point instructions to Floating Point Execution Unit
128
. The instructions sent by Prefetch And Dispatch Unit
122
to Integer Execution Unit
126
contain register addresses, typically two read locations and one write location, where the read locations are the values to be operated on and the write location is where the result will be stored.
FIG. 1
has one integer and one floating point execution unit. To improve performance parallel execution units were added. One parallel execution unit implementation is shown in FIG.
2
. To avoid the confusion and surplus verbiage caused by the inclusion of non-relevant portions of the processor, FIG.
2
and subsequent drawings show only the relevant portions of a processor. As will be appreciated by one of ordinary skill in the art, the portion of a processor shown is functionally integrated into the rest of a processor.
Integer Register File
200
is used by Integer Execution Units
208
and
210
, as well as any other integer execution units that could be connected. Floating Point Register File
202
is used by Floating Point Execution Units
212
and
214
, as well as any other floating point execution units that could be connected. Also shown are Bypass Circuits
204
and
206
. Bypass circuits are needed because one execution unit can attempt both a read and a write to a particular register, or one execution unit may be reading a register in its corresponding register file while another is trying to write to the same register. Depending on the exact timing of the signals as they arrive over the data lines from one or both execution units, this can lead to indeterminate results. Bypass Circuits
204
and
206
detect this condition and arbitrate access. The correct value is sent to the execution unit executing a read, and the correct new value into is written into the register. The circuitry needed to do this is complex for more than one execution unit.
Additional execution units need additional register ports to read and write the register files. The complexity of the bypass circuitry rises as the square of the number of register ports attached; for n register ports on a register file the complexity of the bypass circuitry rises as n
2
. Thus, having too many execution units attached to a register file will slow performance due to the additional complexity of the register file's support circuitry.
Referring now to complexity in general, complexity is an abstract metric of the cost of implementing a given mechanism or feature. Complexity translates most directly into the size of the needed circuits. Higher complexity also correlates with higher latency in the circuitry for most circuits, and higher latency means decreased performance. This means it is generally critical to keep complexity to a minimum; otherwise performance begins to decrease which almost always defeats the purpose of the added circuitry.
In addition to the complexity associated with the number of attached execution units and bypass circuitry, a primary bottleneck on the size of register files is the number of ports that must be made available to read and write the registers. The complexity associated with the number of ports is proportional to the square of the total number of ports on a register file. Since there are typically two read operations for every write operation (i.e., most instructions read two values from a register file and write a resulting value), register files typically have two read ports for every write port. If a register file has 8 read ports and 4 write ports, its relative order of complexity would be on the order of (8+4)
2
=144 with 12 ports, when compared to other register files with other numbers of ports. Using the same register file but trying to increase its throughput by increasing the number of read ports by 4 and the number of write ports by 2 yields a relative order of complexity of
(12+6)2 =324
with 18 ports. As an alternative, adding a duplicate of the original register file yields a relative order of complexity of (8+4)
2
+(8+4)
2
=244 with 24 ports. Thus, using more register files with fewer ports per register file adds less complexity with more ports (for more throughput) than trying to increase the number of ports on a single register file.
The desirable goal of making more registers visible to the programmer and/or compiler is also difficult. In addition to other complexity considerations, the complexity of any register file grows linearly as the number of visible registers grows. To address additional visible registers, more bits in each instruction are needed. This is often not possible given the limited encoding space (field size) of existing instruction set architectures, or is prohibitively expensive in terms of complexity and cost for new instruction sets.
A new architecture was introduced to address some of the complexity issues associated with the need for increased throughput of the register files. It is based on the principle that many ports can be physically implemented with multiple smaller register files. Each smaller register has the same number of total write ports the single register file implementation would have, but a smaller number of read ports. When an implementation uses more than one physical register file, all the register files that takes the place of the single register files are copies of one another. Since the register files are all copies of one another, a write of any one location in one register file is actually performed as a parallel write to all the small register files. Thus, the number of write ports would stay roughly the same when compared to a large register file. However, the number of read ports may be reduced as only local execution units would read from a given register file rather than all the execution units. This reduces the amount of reads going through any given register file, requiring fewer read ports per register file, and therefore the total number of read ports, when compared to a single large register file. This is an
Chuang Chiao-Mei
Jacobson Quinn A.
Martine & Penilla LLP
Sun Microsystems Inc.
Tsai Henry W. H.
LandOfFree
Explicitly clustered register file and execution unit... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Explicitly clustered register file and execution unit..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Explicitly clustered register file and execution unit... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3363491