Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
2001-03-07
2004-11-30
Vital, Pierre M. (Department: 2188)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S133000, C711S144000, C711S145000
Reexamination Certificate
active
06826651
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to the efficient processing of memory requests in cache coherent distributed shared memory multiprocessor systems. More specifically, the present invention leads to improved processing speed and reduced size of directory caches in the coherence controller of shared memory multiprocessor systems by utilizing an efficient directory cache replacement policy.
BACKGROUND OF THE INVENTION
Conventional computer systems often include on-chip or off-chip cache memories that are used with processors to speed up accesses to system memory. In a shared memory multiprocessor system, more than one processor, sometimes referred to as a compute node, can store a copy of the same memory locations (or lines) in their respective cache memories. An internode cache coherence mechanism is required to maintain consistency among the multiple cached copies of the same memory line across all nodes that store the line. In small, bus-based multiprocessor systems, the internode cache coherence mechanism is usually implemented as a part of the cache controllers using a snoopy coherence protocol. The snoopy protocol cannot be used in large systems that are connected through a non-broadcast interconnection network since snoopy protocols may require all requests to be broadcast to all nodes. As a result, these systems use a directory-based protocol to maintain cache coherence, where a memory directory in each node is associated with the main memory. The coherence controllers of cache coherent distributed shared memory multiprocessor systems use these memory directories to maintain a record of the state of memory lines in the caches of the various processing nodes of the multiprocessor system. This state information includes data indicating which cache(s) has a copy of a given line or whether the line has been modified in a cache(s).
FIG. 1
illustrates a representation of an example of a shared memory multi-node computing system. On a system area network
100
, one or more nodes exist. The system can include one or more compute nodes
110
, memory nodes
112
, and/or compute/memory nodes
114
. Each compute node
110
includes one or more processors with associated caches
120
, one or more shared/remote caches
130
(optional), and one or more coherence controllers
160
. Caches
120
and
130
can be referred to collectively as “data cache.” Each memory node
112
contains one or more main memory modules
140
, at least one memory directory
150
which stores information about the contents of its associated main memory
140
, at least one coherence controller
160
, at least one directory cache (DC)
170
, as disclosed in the parent application, and may also include one or more I/O devices (not shown). A compute/memory node
114
includes the elements of a compute node and a memory node.
Memory directories such as
150
may be organized as “full map” memory directories where the state information on every single memory line is stored by mapping each memory line to a unique location in the directory. That is, there is a one to one state mapping between a main memory line and a memory directory entry. As a result, when the size of main memory increases, the memory directory size must also increase. If the memory directory is implemented as relatively fast static RAM, tracking the size of main memory becomes prohibitively expensive. If the memory directory is implemented using slow static RAM or DRAM, higher cost is avoided, but a penalty is incurred in overall system performance due to the slower chips. In fact, each directory access in such slower implementations will take approximately 5-20 controller cycles to complete.
In order to overcome the problems with full map memory directories, “sparse” memory directories have been used instead. In a sparse memory directory multiple memory lines are mapped to a single location in the sparse directory, although state information stored in the mapped location may be valid only for a subset of the memory lines mapping to that location. Due to its smaller size, a sparse directory can be implemented in an economical fashion using fast static RAMs. However, when there is contention among memory lines for the same sparse directory entry field, i.e., when it is desired to store directory information for a given line of memory in a location that already contains current directory information for another line of memory, the state information of the line already stored in the memory directory must be replaced. Since there is no backup state information, when a line is replaced in the sparse directory, all the caches in the overall system having a copy of that line must be asked to invalidate their copies. This incomplete directory information leads to both coherence protocol complexity and performance loss.
Since directory lookup speed and throughput have been shown to have an important impact on the overall performance of distributed shared memory multiprocessor systems, directory caches, as disclosed in the parent application, having records corresponding to a subset of the memory directory entries can be used. Such smaller and faster directory caches associated with the coherence controllers can be used to improve the speed of directory lookup by keeping the copies of a subset of the memory directory entries in faster on-chip or off-chip memory.
FIG. 2
shows a representation of a directory cache (“DC”) system
200
which caches the state information for a subset of memory lines
250
of the main memory
220
. While, in this representation, the memory directory
210
is illustrated as a full map memory directory, the memory directory could also be implemented as a sparse directory, or other directory scheme. The DC
200
is organized into DC lines
230
and stores state information from only a subset of the directory entries
240
.
FIG. 3
shows an organization of a DC, such as DC
200
of FIG.
2
. The DC
200
stores s sets
310
. Each set consists of w DC lines
320
corresponding to w ways of set associativity. Each DC line
320
contains a validity indicator
330
, a tag
340
, and e memory directory entries
350
each containing state information
360
for a line of the associated memory, where w and e are integers greater than or equal to 1 and s is typically a power of 2. The validity indicator
330
, which can be a single bit or multiple bits, indicates that the DC line currently holds valid information. The tag
340
maps the directory entries
350
in the cache line back to their locations in memory directory
210
.
Improved performance for the DC, and accordingly for the entire multiprocessor system, can be achieved if the subset of memory directory entries cached in the DC consists of the most frequently used entries; that is when the hit ratio of the DC is high. The hit ratio of a cache is mainly dependent on the allocation and replacement policies of such a cache.
In conventional caches, one widely used allocation policy caches each and every memory line (or each memory directory entry or set of entries in the case of directory caches) that is referenced and misses in the cache. A less common policy is to designate parts of memory (or the memory directory in the case of directory caches) as uncacheable, so that uncacheable lines do not compete for limited cache space (a.k.a. cache pollution) with cacheable lines that are more likely to benefit from caching, as judged by the programmer. The problem with this latter policy, if applied to directory caches, is that it requires operating system support and programmer annotations, which are undesirable as they require significant extra programmer effort and, at times, may be impossible to implement.
Two common cache replacement policies are Least Recently Used (LRU) and random replacement. LRU is based on the heuristic that memory lines (or memory directory entries in the case of directory caches) that have not been used recently are less likely to be accessed in the near future than others. The random policy, as implied by its name, replaces entries randomly i
Michael Maged M.
Nanda Ashwini
Smith, III Thomas Basil
F. Chau & Associates LLC
International Business Machines - Corporation
Vital Pierre M.
LandOfFree
State-based allocation and replacement for improved hit... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with State-based allocation and replacement for improved hit..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and State-based allocation and replacement for improved hit... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3307990