Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
2001-10-16
2004-03-09
Nguyen, J (Department: 2187)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S118000, C711S119000, C711S120000, C711S133000, C711S141000, C711S142000, C711S144000, C711S143000
Reexamination Certificate
active
06704844
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates generally to data processing systems and, in particular, to processor-cache operations within a multiprocessor data-processing system. Still more particularly, the present invention relates to SMP system optimization via efficient cache coherency operations.
2. Description of the Prior Art
A data-processing system typically includes a processor coupled to a variety of storage devices arranged in a hierarchical manner. In addition to a main memory, a commonly employed storage device in the hierarchy includes a high-speed memory known as a cache memory (or cache). A cache speeds up the apparent access times of the relatively slower main memory by retaining the data or instructions that the processor is most likely to access again, and making the data or instructions available to the processor at a much lower latency. As such, caches enable relatively fast access to a subset of data and/or instructions that were recently transferred from the main memory to the processor, and thus improves the overall speed of the data-processing system.
Most contemporary high-performance data processing system architectures include multiple levels of cache memory within the memory hierarchy. Cache levels are typically employed in progressively longer access latencies. Smaller, faster caches are employed at levels within the storage hierarchy closer to the processor (or processors) while larger, slower caches are employed at levels closer to system memory.
In a conventional symmetric multiprocessor (SMP) data processing system, all of the processors are generally identical, insofar as the processors all utilize common instruction sets and communication protocols, have similar hardware architectures, and are generally provided with similar memory hierarchies. For example, a conventional SMP data processing system, as illustrated in
FIG. 1A
, may comprise a system memory
107
, a plurality of processing elements
101
A-
101
D that each include a processor and one (or more) level(s) of cache memory
103
A-
103
D, and a system bus
105
coupling the processing elements (processors)
101
A-
101
D to each other and to the system memory
107
. Many such systems include at least one level of cache memory shared between two or more processors. Additionally, a “shared” cache line
109
may exist in each cache memory
103
A-
103
D. To obtain valid execution results in a SMP data processing system, it is important to maintain a coherent memory hierarchy, that is, to provide a single view of the contents of memory to all of the processors.
A coherent memory hierarchy is maintained through the use of a selected memory coherency protocol, such as the MESI protocol. In the MESI protocol, an indication of a coherency state is stored in association with each cache line of at least all upper level (cache) memories. Each coherency cache line can have one of four states, “M” (Modified), “E” (Exclusive), “S” (Shared) or “I” (Invalid), which can be encoded by two bits in the cache directory.
FIG. 2
illustrates the MESI protocol and its state transition features. Under the MESI protocol, each cache entry (e.g., a 32-byte sector) has two additional bits which indicate the state of the entry, out of the four possible states. Depending upon the initial state of the entry and the type of access sought by the requesting processor, the state may be changed, and a particular state is set for the entry in the requesting processor's cache. For example, when data in a cache line is in the Modified (M) state, the addressed data is valid only in the cache having the modified cache line, and the modified value has not been written back to system memory. When a cache line is in the Exclusive state, the corresponding data is present only in the noted cache, and is consistent with system memory. If a cache line is in the Shared state, the data is valid in that cache and in at least one other cache, with all of the shared data being consistent with system memory. Finally, when a cache line is in the Invalid state, the addressed data is not resident in the cache. As seen in FIG.
2
and known in the art, the state of the cache line transitions between the various MESI states depending upon particular bus or processor transactions.
There are a number of protocols and techniques for achieving cache coherence that are known to those skilled in the art. At the heart of all these mechanisms for maintaining coherency is the requirement that the protocols allow only one processor to have a “permission” (or lock) that allows a write to a given memory location (cache block) at any given point in time. As a consequence of this requirement, whenever a processor (or processing component) attempts to write to a memory location, the processor must first inform all other processing components of the processor's desire to write into a cache line and invalidate all other processing components' cache line (to the same address).
To implement cache coherency in a system, the processors communicate over a common generalized interconnect (i.e., system bus
105
). The processors pass messages over the interconnect indicating their desire to read or write memory locations. When an operation is placed on the interconnect, all of the other processors “snoop” (monitor) this operation and decide if the state of their caches can allow the requested operation to proceed and, if so, under what conditions. There are several bus transactions that require snooping and follow-up action to honor the bus transactions and maintain memory coherency. The snooping operation is triggered by the receipt of a qualified snoop request, generated by the assertion of certain bus signals. Instruction processing is interrupted only when a snoop hit occurs and the snoop state machine determines that an additional cache snoop is required to resolve the coherency of the offended sector.
This communication is necessary because, in systems with caches, the most recent valid copy of a given block of memory may have moved from the system memory to one or more of the caches in the system (as mentioned above). If a processor attempts to access a memory location not present within its cache hierarchy, the correct version of the block, which contains the actual (current) value for the memory location, may either be in the system memory or in one of more of the caches in another processing unit. If the correct version is in one or more of the other caches in the system, it is necessary to obtain the correct value from the cache(s) in the system instead of system memory.
For example, with reference to
FIG. 1A
, a read transaction that is issued against cache line
109
by P
0
(processor
101
A) and subsequent coherency operations would evolve as follows. P
0
first searches its own L
1
cache
103
A. If the cache line is not present in the L
1
cache
103
A, the request is forwarded to the L
2
cache, then the L
3
cache and so on until the request gets is presented on the generalized interconnect (system bus
105
) to be serviced by one of the other processors or the system memory. Once an operation has been placed on the generalized interconnect, all other processing units P
1
-P
3
snoop the operation and determine if the block is present in their caches. If a given processing unit has the block of data requested by P
0
in its L
1
cache, and that data is modified, by the principle of inclusion the L
2
cache and any lower level caches also have copies of the block (however, their copies are stale, since the copy in the processor's cache is modified). Therefore, when the lowest level cache (e.g., L
3
) of the processing unit snoops the read instruction, it will determine that the block requested is present and modified in a higher level cache. When this occurs, the L
3
cache places a message on the generalized interconnect informing the processing unit that the processing unit must “retry” its operation again at a later time, because the actual value of the memory location is in the L
1
cache a
Arimilli Ravi Kumar
Guthrie Guy Lynn
Starke William J.
Williams Derek Edward
Nguyen J
Salys Casimer K.
LandOfFree
Dynamic hardware and software performance optimizations for... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Dynamic hardware and software performance optimizations for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Dynamic hardware and software performance optimizations for... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3190429