Electrical computers and digital processing systems: memory – Address formation – Address mapping
Reexamination Certificate
2000-02-07
2002-12-17
Elmore, Reba I. (Department: 2187)
Electrical computers and digital processing systems: memory
Address formation
Address mapping
C711S120000, C711S167000, C711S169000
Reexamination Certificate
active
06496917
ABSTRACT:
BACKGROUND
1. Field of Invention
This invention relates generally to superscalar processors, and specifically to improving memory latencies of superscalar processors in multi-processor systems.
2. Description of Related Art
Modern computer systems utilize a hierarchy of memory elements in order to realize an optimum balance between the speed, size, and cost of computer memory. These computer systems typically employ a Dynamic Random Access Memory (DRAM) as primary memory and include a larger, but much slower, secondary memory such as, for instance, a magnetic storage device or Compact Disc Read Only Memory (CD ROM). A small, fast Static Random Access Memory (SRAM) cache memory is typically provided between the central processing unit (CPU) and primary memory. This fast cache memory increases the data bandwidth of the computer system by storing information most frequently needed by the CPU. In this manner, information most frequently requested during execution of a computer program may be rapidly provided to the CPU from the SRAM cache memory, thereby eliminating the need to access the slower primary and secondary memories. Although fast, the SRAM cache memory is very expensive and is therefore typically small to minimize costs.
To further increase performance, high-end computer systems may employ multiple central processing units (CPUs) operating in parallel to allow for the simultaneous execution of multiple instructions of a computer program.
FIG. 1
illustrates a non-uniform memory architecture
1
having four CPU blocks
10
A-
10
D each connected to a system bus
11
. Referring also to
FIG. 2
, each CPU block
10
includes a CPU
20
, an external or L
2
cache
24
, and a primary memory
25
. The external cache
24
(E$) is typically an SRAM device, and the primary memory
25
is typically a DRAM device. Each CPU
20
includes an external cache controller
21
to interface with its external cache
24
, and includes a primary memory controller
22
to interface with its primary memory
25
. Each CPU
20
also includes a bus interface unit
23
to interface with the system bus
11
. Although not shown for simplicity in
FIGS. 1 and 2
, each CPU
20
includes an internal or LI cache, which in turn is typically divided into an instruction cache and a data cache. The instruction cache (I$) stores frequently executed instructions, and the data cache (D$) stores frequently used data.
FIG. 1
shows additional devices connected to system bus
11
. A secondary memory device
12
such as, for instance, a hard-disk or tape drive, provides additional memory. A monitor
13
provides users a graphical interface with system
1
. CPU blocks
10
A-
10
D are connected to a network
14
(e.g., a local area network, a wide area network, a virtual private network, or the Internet) via system bus
11
.
During execution of a computer program, the computer program instructs the various CPUs
20
of system
1
to fetch instructions by incrementing program counters within the various CPUs
20
(program counters not shown for simplicity). In response thereto, each CPU
20
fetches instructions identified by the computer program. If an instruction requests data, an address request specifying the location of that data is issued. A corresponding CPU
20
first searches its internal cache for the data. If the specified data is found in the L
1
cache, the data is immediately provided to the CPU
20
for processing.
If, on the other hand, the specified data is not found in the internal cache, the external cache
24
is then searched. If the specified data is found in the external cache
24
, the data is returned to the CPU
20
for processing. If the specified data is not in the external cache
24
, the address request is forwarded to the system bus
11
, which in turn provides the address request to the primary memory
25
of each of the various CPUs
20
via respective primary memory controllers
22
.
When running a computer program on a multiprocessor system such as system
1
of
FIGS. 1 and 2
, instructions which write to or read from a specified address may be executed by different CPUs. As a result, it is necessary to monitor instructions assigned to the various CPUs in order to maintain data coherency. For example, if a first instruction executed by CPU block
10
A modifies data at a specified address, and a subsequent instruction executed by CPU block
10
B reads data at the specified address, the first instruction must be executed before the second instruction since the data requested by the second instruction is modified by the first instruction.
Data coherency is typically maintained in a multiprocessor system such as the system shown in
FIGS. 1 and 2
by first issuing all primary memory address requests to the system bus
11
, irrespective of whether a particular address request is to the executing CPU's own primary memory (a local request) or to another CPU's primary memory (a remote request). This ensures that instructions are executed by the various CPUs in the order that they were issued to the system bus
11
, the order of which presumably mirrors the instruction order of the computer program. Thus, for instance, in response to an external cache miss, a CPU
20
forwards the primary memory address request to the system bus
11
. Once issued on the system bus
11
, the address request is available to all CPUs
20
. The CPU
20
that issued the address request retrieves the address request back from the system bus
11
, and thereafter searches its primary memory
25
for the specified data. The other CPUs
20
also monitor the address request issued on the system bus
11
to generate well-known snoop information for the requesting CPU. Snoop information maintains cache consistency between the various CPUs
20
by indicating whether data specified by the address request has been modified while stored in the cache of another CPU
20
.
Routing an address request to primary memory
25
via the system bus
11
in response to a cache miss advantageously maintains proper data coherency in a mutiprocessor system. However, routing an address request from a CPU's cache to the CPU's own primary memory
25
via the system bus
11
first requires access to the system bus
11
. The multiple connections to the system bus
11
result in a relatively large amount of traffic on the system bus
11
, which in turn may cause significant delays in arbitrating access to the system bus
11
. These arbitration delays undesirably increase the total latency of primary memory
25
. Since primary memory access speeds are not increasing as quickly as are CPU processing speeds, it is becoming increasingly important to reduce primary memory latencies in order to maximize CPU performance. Indeed, it would be highly desirably to improve primary memory latencies in a multiprocessor computer system while preserving data coherency.
SUMMARY
A method is disclosed that reduces memory latencies in a multiprocessor computer system over the prior art. A multiprocessor system includes a plurality of central processing units (CPUs) connected to one another by a system bus. Each CPU includes a cache controller to communicate with its cache, and a primary memory controller to communicate with its primary memory. In accordance with the present invention, when there is a cache miss in a CPU, the cache controller routes an address request for primary memory directly to the primary memory via the CPU as a speculative request, and also issues the address request to the system bus to maintain data coherency. The speculative request is queued in the primary memory controller, and thereafter retrieves speculative data from a specified primary memory address. The CPU monitors the system bus for requests to the specified primary memory address. If a subsequent transaction requesting the specified data is the read request that was issued on the system bus in response to the cache miss, the speculative request and any data retrieved thereby is validated and becomes non-speculative. If, on the other hand, the subsequent tran
Bhutani Sutikshan
Cherabuddi Rajasekhar
Kasinathan Meera
McGee Brian J.
Normoyle Kevin B.
Elmore Reba I.
Paradice III William L.
Sun Microsystems Inc.
LandOfFree
Method to reduce memory latencies by performing two levels... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method to reduce memory latencies by performing two levels..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method to reduce memory latencies by performing two levels... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2991692