Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
2002-01-07
2004-06-08
Bragdon, Reginald G. (Department: 2187)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S130000, C711S148000, C711S156000, C711S165000, C707S793000, C707S793000, C709S213000, C709S217000
Reexamination Certificate
active
06748498
ABSTRACT:
FIELD OF INVENTION
The present invention relates generally to multiprocessor computer system, and particularly to a multiprocessor system designed to be highly scalable, using efficient cache coherence logic and methodologies that implement store-conditional memory transactions when an associated directory entry is encoded as a coarse bit vector.
BACKGROUND OF THE INVENTION
High-end microprocessor designs have become increasingly more complex during the past decade, with designers continuously pushing the limits of instruction-level parallelism and speculative out-of-order execution. While this trend has led to significant performance gains on target applications such as the SPEC benchmark, continuing along this path is becoming less viable due to substantial increases in development team sizes and design times. Such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. Furthermore, more complex designs are yielding diminishing returns in performance even for applications such as SPEC.
Commercial workloads such as databases and Web applications have surpassed technical workloads to become the largest and fastest-growing market segment for high-performance servers. Commercial workloads, such as on-line transaction processing (OLTP), exhibit radically different computer resource usage and behavior than technical workloads. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates that are characteristic for such workloads. Second, multiple instruction issue and out-of-order execution provide only small gains for workloads such as OLTP due to the data-dependent nature of the computation and the lack of instruction-level parallelism. Third, commercial workloads do not have any use for the high-performance floating-point and multimedia functionality that is implemented in modern microprocessors. Therefore, it is not uncommon for a high-end microprocessor to-stall most of the time while executing commercial workloads, which leads to a severe under-utilization of its parallel functional units and high-bandwidth memory system. Overall, the above trends further question the wisdom of pushing for more complex processor designs with wider issue and more speculative execution, especially if the server market is the target.
Fortunately, increasing chip densities and transistor counts provide architects with several alternatives for better tackling design complexities in general, and the needs of commercial workloads in particular. For example, the Alpha 21364 aggressively exploits semiconductor technology trends by including a scaled 1 GHz 21264 core, two levels of caches, memory controller, coherence hardware, and network router all on a single die. The tight coupling of these modules enables a more efficient and lower latency memory hierarchy that can substantially improve the performance of commercial workloads. Furthermore, the reuse of an existing high-performance processor core in designs such as the Alpha 21364 effectively addresses the design complexity issues and provides better time-to-market without sacrificing server performance. Higher transistor counts can also be used to exploit the inherent and explicit thread-level (or process-level) parallelism that is abundantly available in commercial workloads to better utilize on-chip resources. Such parallelism typically arises from relatively independent transactions or queries initiated by different clients, and has traditionally been used to hide I/O latency in such workloads. Previous studies have shown that techniques such as simultaneous multithreading (SMT) can provide a substantial performance boost for database workloads. In fact, the Alpha 21464 (the successor to the Alpha 21364) combines aggressive chip-level integration along with an eight-instruction-wide out-of-order processor with SMT support for four simultaneous threads.
Typical directory-based cache coherence protocols suffer from extra messages and protocol processing overheads for a number of protocol transactions. These problems are the result of various mechanisms used to deal with resolving races and deadlocks and the handling of “3-hop” transactions that involve a remote node in addition to the requester and the home node (Where the directory resides). For example, negative-acknowledgment messages (NAKs) are common in several cache coherence protocols for dealing with races and resolving deadlocks, which occurs when two or more processors are unable to make progress because each requires a response from one or more of the others in order to do so. The use of NAKs also leads to non-elegant solutions for livelock, which occurs when two or more processors continuously change a state in response to changes in one or more of the others without making progress, and starvation, which occurs when a processor is unable to acquire resources.
Similarly, 3-hop transactions (e.g., requester sends a request, home forwards request to owner, owner replies to requester) typically involve two visits to the home node (along with the corresponding extra messages to the home) in order to complete the transaction. At least one cache coherence protocol avoids the use of NAKs and services most 3-hop transactions with only a single visit to the home node. However, this cache coherence protocol places strict ordering requirements on the underlying transaction-message interconnect
etwork, which goes even beyond requiring point-to-point ordering. These strict ordering requirements are a problem because they make the design of the network more complex. It is much easier to design the routing layer if each packet can be treated independent of any other packet. Also, strict ordering leads to less than optimal use of the available network bandwidth.
The system and method disclosed in the parent application of this present application does not place ordering requirements on the underlying transaction-message interconnect
etwork and avoids the use of NAKs and services most 3-hop transactions with only a single visit to the home node. Further, the parent application does not disclose a cache coherence protocol for handling store-conditional memory transactions.
However, many processor architectures, including the Alpha, support generalized atomic operations through store-conditional memory transactions. Most Alpha systems implement store-conditional memory transactions using a lock-flag and lock-address process. In such systems, if the lock-flag is still set when a store-conditional memory transaction is executed, then the locked line is either in an exclusive or a shared state. If it is in the exclusive state, then the store-conditional memory transaction immediately succeeds because the requesting node (i.e., the node initiating the store-conditional memory transaction) holds exclusive access to the line. If the line is in the shared state, then the requesting node must attempt to obtain exclusive access to the line and complete the store-conditional memory transaction by sending a store-conditional request to the home node (i.e., the node maintaining a directory entry for the line that is the subject of the store-conditional memory transaction). Since another node may get exclusive access to the line first, the store-conditional memory transaction may fail.
Some computer systems use a centralized directory scheme in which the directory entry for each memory line stores information indicating exactly which nodes share copies of the memory line. For instance, the directory entry may contain a bit vector, where each node of the computer system is uniquely represent
Barroso Luiz Andre
Gharachorloo Kourosh
Ravishankar Mosur K.
Scales Daniel J.
Stets Robert J.
Bragdon Reginald G.
Chace Christian P.
Hewlett--Packard Development Company, L.P.
LandOfFree
Scalable multiprocessor system and cache coherence method... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Scalable multiprocessor system and cache coherence method..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Scalable multiprocessor system and cache coherence method... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3351030