Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
1999-05-20
2002-08-06
Ellis, Kevin L. (Department: 2185)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S124000
Reexamination Certificate
active
06430658
ABSTRACT:
BACKGROUND
1. Field of the Present Invention
The present invention generally relates to the field of microprocessor based computers and more specifically to memory subsystem micro architecture in a multi processor system.
2. History of Related Art
Typical multi processor computer systems, until recently, have been designed using a set of discrete, separately packaged microprocessors. The set of microprocessors were typically interconnected via a shared or bi-directional bus commonly referred to as a host bus or local bus. The shared host bus architecture had the advantage of freeing up more pins for other signals in pin-limited microprocessor designs. In addition, the shared bus architecture implied a single active address in any given cycle that simplified arbitration and coherency management. Unfortunately, the shared bus, multi processor architecture requires a complex protocol for requesting and granting the system bus, retrying operations, and so forth. The complexity and handshaking inherent in the bus protocols implied by shared bus systems significantly hampers the ability to pipeline processor operations that require use of the local bus (i.e., any operation that accessed memory below the L
1
cache level of the system). As fabrication technology has progressed to the point that single chip, multi processor devices have become a reality, little attention has been devoted to the possible architectural advancements afforded by the elimination of pin count considerations that constrained multi-chip designs. Accordingly, much of the potential for improved performance offered by single chip devices has gone unfulfilled.
SUMMARY OF THE INVENTION
The problems identified above are in large part addressed by a multi processor system implemented with unidirectional address and data busses between the set of processors and a memory subsystem driven by a single arbiter and a unified pipeline through which all memory subsystem operations are passed. By using a single point of arbitration, the invention greatly simplifies the micro-architecture of the memory subsystem. This simplification in architecture enables a high degree of memory subsystem operation pipe lining that can greatly improve system performance.
Broadly speaking, a first embodiment of the invention emphasizing a single point of coherency arbitration and coherency enforcement includes a memory subsystem for use with a multi processor computer system. The memory subsystem includes an operation block adapted for queuing an operation that misses in an L
1
cache of a multi processor. The multi processor is comprised of a set of processors, preferably fabricated on a single semiconductor substrate and packaged in a single device package. The memory subsystem further includes an arbiter that is configured to receive external snoop operations from a bus interface unit and a queued operation from the operation block. The arbiter is configured to select and initiate one of received operations. Coherency is maintained by forwarding the address associated with the operation selected by the arbiter to each of a plurality of coherency units. In this manner, external and internal snoop addresses are arbitrated at a single point to produce a single subsystem snoop address that is propagated to each coherency unit. Preferably, the operation block includes a load miss block suitable for queuing load type operations and a store miss block suitable for queuing store type operations. In one embodiment, the subsystem includes a unidirectional local interconnect suitable for connecting the memory subsystem and the set of processors. The coherency units preferably include the L
1
cache units of the set of processors, the operation block queues, and each stage of a memory subsystem pipeline.
The first embodiment of the invention further contemplates a method of maintaining coherency in a multi processor computer system in which an external snoop operation is received via a system bus and an internal operation is received from the operation block. An arbitration takes place between the external and internal operations. The arbitration selects and initiates one of the operations and thereby generates a single snoop address. This single snoop address is the broadcast to each of the coherency units to generate a plurality of snoop responses. Preferably the arbitration of the operations is resolved according to a fairness algorithm such as a round robin algorithm. In one embodiment, the plurality of snoop responses are forwarded to a snoop control block unit that is adapted to monitor and modify operations queued in the operation block.
A second embodiment of the invention emphasizing resources for managing queued operations to eliminate retry mechanisms contemplates a multi processor computer system including a set of processors. Each processor in the set includes an execution unit for issuing operations and a processor queue suitable for queuing previously issued and still pending operations. The multi processor further includes means for forwarding operations issued by the processor to the processor queue and to an operation block queue of a memory subsystem that is connected to the multi processor. The depth of (i.e., the number of entries in) the operation block queue matches the depth of the processor queue. The processor queue, when full, inhibits the processor from issuing additional operations. In this manner, an operation issued by the processor is guaranteed an available entry in the operation block queue of the memory subsystem thereby eliminating the need for operation retry circuitry and protocols such as handshaking. Preferably, each processor queue includes a processor load queue and a processor store queue and the operation block queue includes a load queue and a store queue. In this embodiment, the depth of each of the processor load and store queues matches the depth of the operation block load and store queues respectively. In the preferred embodiment, the operation block is comprised of a load miss block that includes the operation block load queue and a store miss block that includes the operation block store queue. Still further preferably, the operation block store queue includes a set of store queues corresponding to the set of processors and the operation block load queue includes a set of load queues corresponding to the set of processors. Each queue entry preferably includes state information indicative of the status of the corresponding entry.
The second embodiment of the invention further contemplates a method of managing operation queue resources in a multi processor computer system. The method includes queuing an operation in a processor queue and in an operation block queue of a memory subsystem and detecting when the processor queue lacks an available entry (i.e., the queue is full). In response to detecting a processor full condition, the processor is then prevented from issuing additional operations thereby assuring that issued operations are guaranteed an entry in the operation block queue. Preferably, the step of queuing includes queuing load operations and store operations separately and queuing operations from each processor separately. In one embodiment, the step of detecting the lack of an available entry includes interpreting status bits associated with each entry in the processor queue. Preferably, the status of an operation in the processor queue is the same as the status of a corresponding operation in the operation block queue.
A third embodiment of the invention emphasizing efficient critical word forwarding contemplates a multi processor computer system including a multi processor device preferably comprised of a set of processors, each including a respective L
1
cache. The multi processor is preferably fabricated as a single device. The computer system includes a memory subsystem comprised of a load miss block adapted for queuing a load operation issued by a first processor that misses in an L
1
cache of the first processor and a store miss block adapted for queuing store type operations. An
Nunez Jose Melanio
Petersen Thomas Albert
Ellis Kevin L.
England Anthony V.
Lally Joseph P.
LandOfFree
Local cache-to-cache transfers in a multiprocessor system does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Local cache-to-cache transfers in a multiprocessor system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Local cache-to-cache transfers in a multiprocessor system will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2972121