Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
2000-04-12
2004-03-09
Kim, Matthew (Department: 2186)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S143000, C711S146000, C711S147000, C711S148000, C711S213000, C711S204000
Reexamination Certificate
active
06704842
ABSTRACT:
TECHNICAL FIELD
The present invention relates generally to high-performance parallel multi-processor computer systems and more particularly to a speculative recall and/or forwarding method to accelerate overall data transfer between processor caches in cache-coherent multi-processor systems.
BACKGROUND ART
Many high-performance parallel multi-processor computer systems are built as a number of nodes interconnected by a general interconnection network (e.g., crossbar and hypercube), where each node contains a subset of the processors and memory in the system. While the memory in the system is distributed, several of these systems (called NUMA systems for Non-Uniform Memory Architecture) support a shared memory abstraction where all the memory in the system appears as a large memory common to all processors in the system. To support high-performance, these systems typically allow processors in various nodes to maintain copies of memory data in their local caches. Since multiple processors can cache the same data, these systems must incorporate a cache coherence mechanism to keep the copies consistent, or coherent. These cache-coherent systems are referred to as ccNUMA systems and examples are DASH and FLASH from Stanford University, ORIGIN from Silicon Graphics, STING from Sequent Computers, and NUMAL from Data General.
Coherence is maintained in ccNUMA systems using a directory-based coherence protocol. With coherence implemented in hardware, special hardware coherence controllers maintain the coherence directory and execute the coherence protocol. To support better performance, the coherence protocol is usually distributed among the nodes. With current solutions, a coherence controller is associated with each memory unit that manages the coherence of data mapped to that memory unit. Each line of memory (typically a portion of memory tens of bytes in size) is assigned a home node, which manages the sharing of that memory line, and guarantees its coherence.
The home node maintains a directory, which identifies the nodes that possess a copy of the memory line. When a node requires a copy of the memory line, it requests the memory line from the home node. The home node supplies the data from its memory if its memory has the latest data. If another node has the latest copy of the data, the home node directs this node to forward the data to the requesting node. The home node employs a coherence protocol to ensure that when a node writes a new value to the memory line, all other nodes see this latest value. Coherence controllers implement this coherence functionality.
In typical multi-processor systems, exchanging messages on the network and looking up tables are fairly lengthy operations. Hence, substantial time may elapse between the time access to a data block is requested and the time the data block is received from another processor's cache. This latency is especially high when the requesting processor, the memory and coherence controller managing the data block, and the processor with the modified data are in three different nodes of the system since at least three inter-node messages are necessary. For example, this latency may be about 250 processor clock cycles. As processors continue to increase in their speed relative to the speed of the network and memory, this latency will progressively get higher. In many situations (such as when the processor wants to read the memory data block), the processor cannot perform any useful computation while it waits for the data block to arrive from the cache of the other processor. This leads to inefficient utilization of expensive processor resources and overall poor performance of the application.
The long latency in accessing modified data from another processor's and its negative impact on application performance is a well-known problem. Several solutions have been proposed to alleviate this problem. The mechanisms in the prior art all follow the approach of propagating data modifications to the copies in other processor's caches so that a processor can access the latest data in its cache itself.
In the typical cache-coherent multi-processor system, when a memory data block required (for reading or for writing) by a processor is not currently available in its cache, a message must be sent to the memory system requesting a copy of the data block. If the required memory data block is present in another processor's cache with a modified value, this new value must be provided to the requesting processor (this is called a cache-to-cache transfer). With typical coherence protocols, this is accomplished in the following way. When a processor A requires access to a data block, it sends a message to the memory and coherence controller managing the data block requesting a copy of the data block. The memory and coherence controller determines from a table that the data block is potentially in a modified state in another processor B's cache. The memory and coherence controller sends a message to processor B requesting that the data block be sent to processor A. Upon receiving the message, processor B sends the data block to processor A and also notifies the memory and coherence controller that it has done so.
In other past multi-processor systems, which use write-update coherence protocols, when a processor modified a data block in its cache, the modified data block is immediately forwarded to all processors that have a copy of the data block in their cache. Since all copies of the data block are updated on every write, a processor accessing the data block in its cache will observe the latest value of the data block in its cache itself. The processor's access, hence, does not incur the latency of network messages and table lookup. Write-update protocols are not suitable, however, for several reasons. Firstly, commercial microprocessors do not support the write-update protocol (they support the write-invalidate protocol). Since the cache hierarchy in commercial processors is write-back, the caches do not propagate each write to the processor bus. Also, when a data block is to be modified, most processor bus protocols invalidate the data block in all other caches rather than updating them with the new value. Furthermore, while updates require that data be supplied to a cache that did not request it, processor bus protocols do not support any transaction that transfers data without an associated request on the bus. Secondly, write-update protocols are wasteful in bandwidth and can degrade performance. Updating all copies of a data block on each write to the data block can be wasteful because a processor receiving the updates may not use the data block at all. Also, updates of each individual write may be unnecessary in cases when a processor uses the data block only after a series of modifications to the data block have been completed. Updates also impose substantial bandwidth load on the buses, networks and processor caches. This bandwidth load can cause increased contention and queuing delays in the system degrading performance. Thirdly, since updates are sent only to processors that have a copy of the data block, write-update protocols do not provide any benefit when a processor's cache does not contain a copy of the data block.
Other past multiprocessor systems use what is known as the competitive-update mechanism, which is a hybrid between write-invalidate protocols and write-update protocols. As with write-update protocols, when a data block is modified all copies of the data block are updated. However, when a processor receiving the updates has not accessed its copy of the data block for several updates (a predetermined “competitive threshold”), its copy of the data block is invalidated. Subsequent updates to the data block will not be sent to this processor. When updates are unnecessary, this approach minimizes update bandwidth over the pure write-update protocol. However, the competitive-update approach retains the other disadvantages: it wastes network bandwidth when the updates are not used (e.g. in mig
Janakiraman Gopalakrishnan
Kumar Rajendra
Kim Matthew
Li Zhuo H.
LandOfFree
Multi-processor system with proactive speculative data transfer does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Multi-processor system with proactive speculative data transfer, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multi-processor system with proactive speculative data transfer will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3193690