Electrical computers and digital processing systems: processing – Processing architecture – Distributed processing system
Reexamination Certificate
1999-03-22
2001-07-31
Pan, Daniel H. (Department: 2183)
Electrical computers and digital processing systems: processing
Processing architecture
Distributed processing system
C712S029000, C712S013000, C712S014000, C711S153000, C711S147000
Reexamination Certificate
active
06269437
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates generally to microprocessors and other types of digital data processors, and more particularly to processors which utilize a clustered organization, i.e., an organization in which groups of execution units are each associated with a designated portion of a register file.
BACKGROUND OF THE INVENTION
A significant problem with wide-issue load-store microprocessors is port pressure on the register file, i.e., the register file must support a large number of simultaneous accesses, and therefore the register file must have many ports. A fully-connected processor organization has execution units which each have full access to the entire register file. Predicate registers and lock files for both registers and predicates also require a correspondingly large number of ports. Since the number of ports can adversely impact the area, cost and maximum clock speed of the processor, it is generally desirable to keep the number of ports under some small number, such as 16 or 32. Execution units and register files may therefore be “clustered” in order to reduce the number of ports required for all simultaneously-utilized execution units.
A clustered organization, in contrast to a fully-connected organization, has groups, i.e., “clusters,” of execution units, each with a portion of the register file. The portion of the register file associated with a given cluster may be referred to as “local” registers. The execution units in a given cluster have full access to the local registers, but limited access to the registers of other clusters. In a clustered organization, the degree of access one cluster has to the others' register files and the interconnection between clusters must be specified. The purpose of clustering is to reduce the register file port pressure. However, the need for some execution units to have global register file access keeps the typical cluster implementation from being truly scalable. In particular, load, store, and branch units, if shared between clusters, generally need global register file access. Register file ports can be shared among units requiring access to them. In this case, techniques for arbitrating among them, and for stalling a unit which is not allowed to use a port it has requested, generally must be provided.
Each type of execution unit in a processor needs a certain number of register file ports to support its operation. With the use of a technique such as virtual single cycle execution, as described in U.S. patent application Ser. No. 09/080,787 filed May 18, 1998 and entitled “Virtual Single-Cycle Execution in Pipelined Processors,” it also requires a certain number of ports on a file of lock registers, a logically separate entity. With predicated execution based on architecturally separate predicate registers, a certain number of ports are also required on the predicate file and the predicate lock file.
FIG. 1
summarizes the port requirements for the following types of conventional execution units: branch units, store units, load units, memory units and arithmetic logic units (ALUs). The instructions associated with each of these types of execution units will be described below. Branch units process conditional branch instructions of the form
[(p)]branch to r
x
if r
y
∘r
z
,
where register r
x
contains an instruction address, and registers r
y
and r
z
contain the values to be compared using the operator ∘ (representing operators such as =, <, >, etc.). The branch instruction requires reads of r
x
, r
y
and r
z
, reads of the locks on r
x
, r
y
and r
z
, and a read of predicate p and the lock on predicate p.
Store units process store instructions of the form
[(p)]mem[r
x
+r
y
]←r
z
.
The store instruction requires reads of r
x
, r
y
and r
z
, reads of the locks on r
x
, r
y
and r
z
, and a read of predicate p and the lock on predicate p. It is assumed for this example that predicate values are never individually stored in memory; for spilling and context switches, a block store instruction should be provided, which would not be executed in parallel with other instructions.
Load units process load instructions of the form
[(p)]r
x
←mem[r
y
+r
z
].
The load instruction requires reads of r
y
and r
z
, and a write of r
x
. It requires reads of the locks on r
x
, r
y
, and r
z
, and two writes of the lock on r
x
, i.e., once to lock it, and once to unlock it. It also requires the read of predicate p and the lock on predicate p. It is assumed for this example that predicate values are never individually loaded from memory; for filling and context switches, a block load instruction should be provided, which would not be executed in parallel with other instructions.
A memory unit can perform either a load or a store on each cycle. Therefore, it has the combined port requirements of a load and store unit. It may seem that the memory unit requires only three total register ports, since it cannot perform both a load and a store simultaneously. However, in a pipelined memory unit, a load followed by a number of stores will require four simultaneous register accesses during the load writeback. Conversely, a store followed by a load will use only two ports when the load is at register read. The average number of ports is three, but the peak is four.
Instructions processed by the ALU may be of the form
[(p)]r
x
←r
y
∘r
z
,
where operator ∘ represents &, +, etc., and predicate p, if provided, indicates whether the instruction's results should be written back or annulled. These instructions require reads of registers r
y
and r
z
and a write of register r
x
. They require reads of the locks on r
x
, r
y
, and r
z
, and two writes of the lock on r
x
, i.e., one to lock the register at register read, and one to unlock the register at register writeback. Two write ports are required on the lock file for any unit which writes to a register. Even though the first write to the lock (at register read) and the second (at register writeback) are displaced in time, in order to be able to issue an instruction to the unit on every cycle, two write ports must be dedicated to it; if only one is given, the first write for a later instruction and the second write for an earlier instruction will contend for it.
The ALUs may also perform a predicate move instruction, having the form
[(P)]p
y
←p
z
.
To support this form of an ALU instruction, each ALU requires two predicate read ports, one predicate write port, three predicate lock read ports and two predicate lock write ports. Another form of ALU instruction sets or clears a predicate, based on a comparison between registers, and may have the following form
[(p
x
)]set p
y
if r
y
∘r
z
or
[(p
x
)]clear p
y
if r
y
∘r
z
,
where the operator ∘ represents =, <, etc. The number of ports already provided above will support this form of ALU instruction.
FIG. 2
shows the fully-connected port requirements for exemplary organizations O1 and O2, and a more general processor organization. Organization O1 has one branch unit, one memory unit, and four ALUs. O2 has two branch units, four memory units, and 32 ALUs. The general processor organization has b branch units, l load units, s store units, m memory units, and a ALUs. As noted previously, in a clustered organization, the register files and the set of execution units are partitioned into partially connected groups: each execution unit has full access to the register files in its local cluster, but limited access to the register files in any other cluster; the degree of access and the method of communication between clusters must be specified. A clustered organization with c clusters and e execution units in each cluster has a=ce total execution units in the clusters. An unclustered organization of the same size could be described either as having ce units in one cluster or as having c fully-connected clusters with e execut
Batten Dean
D'Arcy Paul Gerard
Glossner C. John
Jinturkar Sanjay
Wires Kent E.
Agere Systems Guardian Corp.
Pan Daniel H.
Ryan & Mason & Lewis, LLP
LandOfFree
Duplicator interconnection methods and apparatus for... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Duplicator interconnection methods and apparatus for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Duplicator interconnection methods and apparatus for... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2458677