Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
1997-11-17
2003-10-14
Gossage, Glenn (Department: 2187)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S119000, C711S144000, C711S145000
Reexamination Certificate
active
06633958
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to memory access control in multiprocessors, and more particularly to a directory structure for maintaining cache coherency.
2. Background Information
In a shared-memory multiprocessor system with per-processor caches, it is desirable for hardware to keep the caches coherent. This means that when one or more processors have a copy of a block or line of memory and one of the processors performs a write to that line, the cached copies must either be updated or invalidated.
There are two basic mechanisms for doing this. One is to have all processors “snoop” on all write operations in the system and invalidate or update their own copies when appropriate. This is called “snoopy” coherence or “snooping caches”, and is generally done in shared-bus multiprocessors.
The other alternative is to use directories that are associated with main memory. In large systems, memory, and thus the directories, may be physically distributed among the machine. The directory associated with a region of memory keeps track of which processors, if any, have copies of memory lines in that region. When a write occurs to a line, the directory is consulted, and updates or invalidates are sent to any processors that have a copy of the line. In this way, processors are relieved from having to snoop on all memory write traffic.
The basic problem with directories is that the required storage does not scale well. The canonical directory structure associates a directory entry with every line of main memory. Each directory entry consists of some state, along with a bit vector (one bit per processor in the system) indicating which processor(s) currently have a copy of the line. As the number of processors is increased, the number of lines for which directory information is needed grows linearly. In addition, the size of the directory entries also grows linearly. The net result is that the total amount of storage needed for the directories grows as the square of the number of processors, becoming prohibitively expensive for large systems.
There are a number of ways to reduce directory overhead. One way is to reduce the number of directory entries by noting that, since caches are much smaller than main memory, most memory lines will not be cached by any processor at any given time. Thus, the size of the directories are reduced, and the directory is managed as a cache. Any line for which a directory entry cannot be found is assumed to be in the uncached state.
Another way to reduce directory overhead is to limit the size of the directory entries such that an entry cannot represent any arbitrary set of processors in the system. The system is then either prohibited from allowing non-representable sets of processors to cache a line concurrently (by, say, invalidating the copies of certain processors when other processors obtain a copy of the line), or, more preferably, when a non-representable set of sharing processors occurs, the directory entry is set to represent some superset of the sharing processors. Then when the line is written, an invalidation or update message is sent to the superset of processors caching the line.
(The two approaches are orthogonal. That is, use of both approaches in tandem should result in a greater reduction in directory overhead than can be achieved with either one by itself.)
A mechanism for limiting the size of the directory entries is discussed next. In the following discussion, we will for convenience assume an invalidate-based coherence protocol, although the approach is equally applicable to an update-based protocol. A goal of a directory structure for a large multiprocessor system is to use a modest number of bits in a directory entry, yet minimize the number of “spurious” invalidation messages that must be sent when a line is written. That is, keep the superset as close to the size of the actual set of sharers as possible.
At one end of the spectrum is a full broadcast mechanism. In this scheme, as soon as a line becomes cached by any processor (or perhaps only when it is cached by more than one processor), the state of the corresponding directory indicate is set to indicate that a broadcast is necessary. When the line is written, invalidations are sent to all processors in the system. This mechanism minimizes the number of bits needed in the directory entry, but maximizes spurious invalidations. The performance of this scheme is very poor for large systems.
At the other end of the spectrum is the full bit-vector mechanism described above, in which a directory entry includes a bit for each processor in the system. This maximizes directory storage overhead, but eliminates spurious invalidations. The storage overhead for this scheme is unacceptable for large systems.
A reasonable middle ground is a “coarse-vector” directory structure like the one used in the Origin “™” 2000 manufactured by Silicon Graphics Inc (SGI) of Mountain View, Calif. The directory structure in the Origin “™” 2000 includes a bit vector of size v in the directory entries (where v=32). Each bit represents one or more processor nodes in the system; for systems with thirty-two or fewer nodes, this size bit vector acts like a full bit vector. For larger numbers of nodes, however, the vector can become “coarse”. When the set of processor nodes sharing a line is contained within an aligned block of consecutive processor nodes, then the bit vector can still be used as a full bit vector, with another small field in the directory entry specifying the block of processor nodes the vector represents. Processor nodes will typically contain one or more processors; in the Origin “™” 2000, each processor node includes two processors.
When the set of processor nodes expands beyond an aligned block of v processor nodes, however, the meaning of the bits in the vector is changed (this is recorded in the state information in the entry). For N-processor-node systems, each bit in the vector now represents N/v processor nodes. For example, in a 512-processor node system with a 32-bit vector, each bit in the vector represents sixteen processor nodes. For every processor node caching the line, the bit representing the set of processor nodes containing that processor node would be set. When the line is written, for each bit that is set in the coarse vector, invalidation messages are sent to the corresponding set of N/v processor nodes. In most cases, this will cause invalidations to be sent to some processor nodes that do not have a copy of the line (spurious invalidates). However, the number of spurious invalidates will generally be much fewer than would be sent with the full broadcast mechanism.
Such an approach is effective in reducing the size of each directory entry. Such an approach does, however, continue to scale linearly as the number of processors in the system increase. What is needed is a better way of compacting directory entries.
SUMMARY OF THE INVENTION
The present invention is a cache coherence system and method for use in a multiprocessor computer system having a plurality of processor nodes, a memory and an interconnect network connecting the plurality of processor nodes to the memory. Each processor node includes one or more processors. The memory includes a plurality of lines and a cache coherence directory structure having a plurality of directory structure entries. Each of the directory structure entries is associated with one of the plurality of lines and each directory structure entry includes processor pointer information, expressed as a set of bit vectors, indicating the processor nodes that have cached copies of lines in memory.
According to another aspect of the present invention, a method is described for maintaining cache coherency across a computer system having a plurality of processor nodes, including a first and a second processor node, wherein each of the plurality of processor nodes includes a cache and at least one processor. The method comprises the steps of assigning a processor number to each of the plurality o
Passint Randal S.
Scott Steven L.
Gossage Glenn
Silicon Graphics Inc.
LandOfFree
Multiprocessor computer system and method for maintaining... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Multiprocessor computer system and method for maintaining..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multiprocessor computer system and method for maintaining... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3164580