Scalable multiprocessor system and cache coherence method

Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Scalable multiprocessor system and cache coherence method Scalable multiprocessor system and cache coherence method

: 2001-06-11
: 2004-06-15
: Sparks, Donald (Department: 2187)
: Electrical computers and digital processing systems: memory
: Storage accessing and control
: Hierarchical memories

: Reexamination Certificate
: active
: 06751710
: ABSTRACT:

The present invention relates generally to multiprocessor computer system, and particularly to a multiprocessor system designed to be highly scalable, using efficient cache coherence logic and methodologies.
BACKGROUND OF THE INVENTION
High-end microprocessor designs have become increasingly more complex during the past decade, with designers continuously pushing the limits of instruction-level parallelism and speculative out-of-order execution. While this trend has led to significant performance gains on target applications such as the SPEC benchmark, continuing along this path is becoming less viable due to substantial increases in development team sizes and design times. Such designs are especially ill suited for important commercial applications, such as on-line transaction processing (OLTP), which suffer from large memory stall times and exhibit little instruction-level parallelism. Given that commercial applications constitute by far the most important market for high-performance servers, the above trends emphasize the need to consider alternative processor designs that specifically target such workloads. Furthermore, more complex designs are yielding diminishing returns in performance even for applications such as SPEC.
Commercial workloads such as databases and Web applications have surpassed technical workloads to become the largest and fastest-growing market segment for high-performance servers. Commercial workloads, such as on-line transaction processing (OLTP), exhibit radically different computer resource usage and behavior than technical workloads. First, commercial workloads often lead to inefficient executions dominated by a large memory stall component. This behavior arises from large instruction and data footprints and high communication miss rates that are characteristic for such workloads. Second, multiple instruction issue and out-of-order execution provide only small gains for workloads such as OLTP due to the data-dependent nature of the computation and the lack of instruction-level parallelism. Third, commercial workloads do not have any use for the high-performance floating-point and multimedia functionality that is implemented in modern microprocessors. Therefore, it is not uncommon for a high-end microprocessor to stall most of the time while executing commercial workloads, which leads to a severe under-utilization of its parallel functional units and high-bandwidth memory system. Overall, the above trends further question the wisdom of pushing for more complex processor designs with wider issue and more speculative execution, especially if the server market is the target.
Fortunately, increasing chip densities and transistor counts provide architects with several alternatives for better tackling design complexities in general, and the needs of commercial workloads in particular. For example, the Alpha 21364 aggressively exploits semiconductor technology trends by including a scaled 1 GHz 21264 core, two levels of caches, memory controller, coherence hardware, and network router all on a single die. The tight coupling of these modules enables a more efficient and lower latency memory hierarchy that can substantially improve the performance of commercial workloads. Furthermore, the reuse of an existing high-performance processor core in designs such as the Alpha 21364 effectively addresses the design complexity issues and provides better time-to-market without sacrificing server performance. Higher transistor counts can also be used to exploit the inherent and explicit thread-level (or process-level) parallelism that is abundantly available in commercial workloads to better utilize on-chip resources. Such parallelism typically arises from relatively independent transactions or queries initiated by different clients, and has traditionally been used to hide I/O latency in such workloads. Previous studies have shown that techniques such as simultaneous multithreading (SMT) can provide a substantial performance boost for database workloads. In fact, the Alpha 21464 (the successor to the Alpha 21364) combines aggressive chip-level integration along with an eight-instruction-wide out-of-order processor with SMT support for four simultaneous threads.
Typical directory-based cache coherence protocols suffer from extra messages and protocol processing overheads for a number of protocol transactions. These problems are the result of various mechanisms used to deal with resolving races and deadlocks and the handling of “3-hop” transactions that involve a remote node in addition to the requester and the home node (where the directory resides). For example, negative-acknowledgment messages (NAKs) are common in several cache coherence protocols for dealing with races and resolving deadlocks, which occurs when two or more processors are unable to make progress because each requires a response from one or more of the others in order to do so. The use of NAKs also leads to non-elegant solutions for livelock, which occurs when two or more processors continuously change a state in response to changes in one or more of the others without making progress, and starvation, which occurs when a processor is unable to acquire resources.
Similarly, 3-hop transactions (e.g., requestor sends a request, home forwards request to owner, owner replies to requester) typically involve two visits to the home node (along with the corresponding extra messages to the home) in order to complete the transaction. At least one cache coherence protocol avoids the use of NAKs and services most 3-hop transactions with only a single visit to the home node. However, this cache coherence protocol places strict ordering requirements on the underlying transaction-message interconnect
etwork, which goes even beyond requiring point-to-point ordering. These strict ordering requirements are a problem because they make the design of the network more complex. It is much easier to design the routing layer if each packet can be treated independent of any other packet. Also, strict ordering leads to less than optimal use of the available network bandwidth.
The present invention also avoids the use of NAKs and services most 3-hop transactions with only a single visit to the home node. Exceptions include read transactions that require two visits to the home node because of a sharing write-back that is sent back to the home node. However, the present invention does not place ordering requirements on the underlying transaction-message interconnect
etwork.
SUMMARY OF THE INVENTION
In summary, the present invention is a system including a plurality of processor nodes configured to execute a cache coherence protocol that avoids the use of NAKs and ordering requirements on the underlying transaction-message interconnect
etwork and services most 3-hop transactions with only a single visit to the home node. Each node has access to a memory subsystem that stores a multiplicity of memory lines of information and a directory. Additionally, each node includes a memory cache for caching a multiplicity of memory lines of information stored in stored in a memory subsystem accessible to other nodes. Further, a protocol engine is included in each node to implement the negative acknowledgment free cache coherence protocol. The protocol engine itself includes a memory transaction array for storing an entry related to a memory transaction, which includes a memory transaction state. A memory transaction concerns a memory line of information and includes a series of protocol messages, which are routed both within a given node and to other nodes. Also included in the protocol engine is logic for processing memory transactions. This processing includes advancing the memory transaction when predefined criteria are satisfied (e.g., receipt of a protocol message) and storing an updated state of the memory transaction in the memory transaction array.

REFERENCES:
patent: 6012127 (2000-01-01), McDonald et al.

Affiliated with

Barroso Luiz A.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Gharachorloo Kourosh

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Ravishankar Mosur K.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Scales Daniel J.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Stets, Jr. Robert J.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Sparks Donald

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Takeguchi Kathy

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Scalable multiprocessor system and cache coherence method does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Scalable multiprocessor system and cache coherence method, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Scalable multiprocessor system and cache coherence method will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-3348356

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure