Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
2001-06-21
2004-06-22
Padmanabhan, Mano (Department: 2188)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S141000, C711S146000, C711S119000, C711S124000
Reexamination Certificate
active
06754782
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates in general to data processing systems and, in particular, to non-uniform memory access (NUMA) and other multiprocessor data processing systems having improved queuing, communication and/or storage efficiency.
2. Description of the Related Art
It is well-known in the computer arts that greater computer system performance can be achieved by harnessing the processing power of multiple individual processors in tandem. Multi-processor (MP) computer systems can be designed with a number of different topologies, of which various ones may be better suited for particular applications depending upon the performance requirements and software environment of each application. One common MP computer topology is a symmetric multi-processor (SMP) configuration in which each of multiple processors shares a common pool of resources, such as a system memory and input/output (I/O) subsystem, which are typically coupled to a shared system interconnect. Such computer systems are said to be symmetric because all processors in an SMP computer system ideally have the same access latency with respect to data stored in the shared system memory.
Although SMP computer systems permit the use of relatively simple inter-processor communication and data sharing methodologies, SMP computer systems have limited scalability. In other words, while performance of a typical SMP computer system can generally be expected to improve with scale (i.e., with the addition of more processors), inherent bus, memory, and input/output (I/O) bandwidth limitations prevent significant advantage from being obtained by scaling a SMP beyond a implementation-dependent size at which the utilization of these shared resources is optimized. Thus, the SMP topology itself suffers to a certain extent from bandwidth limitations, especially at the system memory, as the system scale increases. SMP computer systems are also not easily expandable. For example, a user typically cannot purchase an SMP computer system having two or four processors, and later, when processing demands increase, expand the system to eight or sixteen processors.
As a result, an MP computer system topology known as non-uniform memory access (NUMA) has emerged to addresses the limitations to the scalability and expandability of SMP computer systems. As illustrated in
FIG. 1
, a conventional NUMA computer system
8
includes a number of nodes
10
connected by a switch
12
. Each node
10
, which can be implemented as an SMP system, includes a local interconnect
11
to which number of processing units
14
are coupled. Processing units
14
each contain a central processing unit (CPU)
16
and associated cache hierarchy
18
. At the lowest level of the volatile memory hierarchy, nodes
10
further contain a system memory
22
, which may be centralized within each node
10
or distributed among processing units
14
as shown. CPUs
16
access memory
22
through a memory controller
20
.
Each node
10
further includes a respective node controller
24
, which maintains data coherency and facilitates the communication of requests and responses between nodes
10
via switch
12
. Each node controller
24
has an associated local memory directory (LMD)
26
that identifies the data from local system memory
22
that are cached in other nodes
10
, a remote memory cache (RMC)
28
that temporarily caches data retrieved from remote system memories, and a remote memory directory (RMD)
30
providing a directory of the contents of RMC
28
.
The present invention recognizes that, while the conventional NUMA architecture illustrated in
FIG. 1
can provide improved scalability and expandability over conventional SMP architectures, the conventional NUMA architecture is subject to a number of drawbacks. First, communication between nodes is subject to much higher latency (e.g., five to ten times higher latency) than communication over local interconnects
11
, meaning that any reduction in inter-node communication will tend to improve performance. Consequently, it is desirable to implement a large remote memory cache
28
to limit the number of data access requests that must be communicated between nodes
10
. However, the conventional implementation of RMC
28
in static random access memory (SRAM) is expensive and limits the size of RMC
28
for practical implementations. As a result, each node is capable of caching only a limited amount of data from other nodes, thus necessitating frequent high latency inter-node data requests.
A second drawback of conventional NUMA computer systems related to inter-node communication latency is the delay in servicing requests caused by unnecessary inter-node coherency communication. For example, prior art NUMA computer systems such as that illustrated in
FIG. 1
typically allow remote nodes to silently deallocate unmodified cache lines. In other words, caches in the remote nodes can deallocate shared or invalid cache lines retrieved from another node without notifying the home node's local memory directory at the node from which the cache line was “checked out.” Thus, the home node's local memory directory maintains only an imprecise indication of which remote nodes hold cache lines from the associated system memory. As a result, when a store request is received at a node, the node must broadcast a Flush (i.e., invalidate) operation to all other nodes indicated in the home node's local memory directory as holding the target cache line regardless of whether or not the other nodes still cache a copy of the target cache line. In some operating scenarios, unnecessary flush operations can delay servicing store requests, which adversely impacts system performance.
Third, conventional NUMA computer systems, such as NUMA computer system
8
, tend to implement deep queues within the various node controllers, memory controllers, and cache controllers distributed throughout the system to allow for the long latencies to which inter-node communication is subject. Although the implementation of each individual queue is inexpensive, the deep queues implemented throughout conventional NUMA computer systems represent a significant component of overall system cost. The present invention therefore recognizes that it would advantageous to reduce the pendency of operations in the queues of NUMA computer systems and otherwise improve queue utilization so that queue depth, and thus system cost, can be reduced.
In view of the foregoing and additional drawbacks to conventional NUMA computer systems, the present invention recognizes that it would be useful and desirable to provide a NUMA architecture having improved queuing, storage and/or communication efficiency.
SUMMARY OF THE INVENTION
The present invention overcomes the foregoing and additional shortcomings in the prior art by providing a non-uniform memory access (NUMA) computer system and associated method of operation that integrate the remote memory cache of a NUMA node into the node's local system memory.
In accordance with a preferred embodiment of the present invention, a NUMA computer system includes at least a remote node and a home node coupled to an interconnect. The remote node contains at least one processing unit coupled to a remote system memory, and the home node contains at least a home system memory. To reduce access latency for data from other nodes, a portion of the remote system memory is allocated as a remote memory cache containing data corresponding to data resident in the home system memory. In one embodiment, access bandwidth to the remote memory cache is increased by distributing the remote memory cache across multiple system memories in the remote node.
The above as well as additional objects, features, and advantages of the present invention will become apparent in the following detailed written description.
The present invention overcomes the foregoing and additional shortcomings in the prior art by providing a non-uniform memory access (NUMA) computer system and associated method of ope
Arimilli Ravi Kumar
Dodson John Steven
Fields, Jr. James Stephen
Baker Paul A
Bracewell & Patterson L.L.P.
International Business Machines - Corporation
Salys Casimer K.
LandOfFree
Decentralized global coherency management in a multi-node... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Decentralized global coherency management in a multi-node..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Decentralized global coherency management in a multi-node... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3318250