Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
1999-05-18
2002-10-29
Kim, Matthew (Department: 2186)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S123000, C711S125000, C711S126000
Reexamination Certificate
active
06473832
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention is related to the field of processors and, more particularly, to load/store units within processors.
2. Description of the Related Art
Processors are more and more being designed using techniques to increase the number of instructions executed per second. Superscalar techniques involve providing multiple execution units and attempting to execute multiple instructions in parallel. Pipelining, or superpipelining, techniques involve overlapping the execution of different instructions using pipeline stages. Each stage performs a portion of the instruction execution process (involving fetch, decode, execution, and result commit, among others), and passes the instruction on to the next stage. While each instruction still executes in the same amount of time, the overlapping of instruction execution allows for the effective execution rate to be higher. Typical processors employ a combination of these techniques and others to increase the instruction execution rate.
As processors employ wider superscalar configurations and/or deeper instruction pipelines, memory latency becomes an even larger issue than it was previously. While virtually all modem processors employ one or more caches to decrease memory latency, even access to these caches is beginning to impact performance.
More particularly, as processors allow larger numbers of instructions to be in-flight within the processors, the number of load and store memory operations which are in-flight increases as well. As used here, an instruction is “in-flight” if the instruction has been fetched into the instruction pipeline (either speculatively or non-speculatively) but has not yet completed execution by committing its results (either to architected registers or memory locations). Additionally, the term “memory operation” is an operation which specifies a transfer of data between a processor and memory (although the transfer may be accomplished in cache). Load memory operations specify a transfer of data from memory to the processor, and store memory operations specify a transfer of data from the processor to memory. Load memory operations may be referred to herein more succinctly as “loads”, and similarly store memory operations may be referred to as “stores”. Memory operations may be implicit within an instruction which directly accesses a memory operand to perform its defined function (e.g. arithmetic, logic, etc.), or may be an explicit instruction which performs the data transfer only, depending upon the instruction set employed by the processor. Generally, memory operations specify the affected memory location via an address generated from one or more operands of the memory operation. This address will be referred to herein in as a “data address” generally, or a load address (when the corresponding memory operation is a load) or a store address (when the corresponding memory operation is a store). On the other hand, addresses which locate the instructions themselves within memory are referred to as “instruction addresses”.
Since memory operations are part of the instruction stream, having more instructions in-flight leads to having more memory operations in-flight. Unfortunately, adding additional ports to the data cache to allow more operations to occur in parallel is generally not feasible beyond a few ports (e.g. 2) due to increases in both cache access time and area occupied by the data cache circuitry. Accordingly, relatively larger buffers for memory operations are often employed. Scanning these buffers for memory operations to access the data cache is generally complex and, accordingly, slow. The scanning may substantially impact the load memory operation latency, even for cache hits.
Additionally, data caches are finite storage for which some load and stores will miss. A memory operation is a “hit” in a cache if the data accessed by the memory operation is stored in cache at the time of access, and is a “miss” if the data accessed by the memory operation is not stored in cache at the time of access. When a load memory operation misses a data cache, the data is typically loaded into the cache. Store memory operations which miss the data cache may or may not cause the data to be loaded into the cache. Data is stored in caches in units referred to as “cache lines”, which are the minimum number of contiguous bytes to be allocated and deallocated storage within the cache. Since many memory operations are being attempted, it becomes more likely that numerous cache misses will be experienced. Furthermore, in many common cases, one miss within a cache line may rapidly be followed by a large number of additional misses to that cache line. These misses may fill, or come close to filling, the buffers allocated within the processor for memory operations. An efficient scheme for buffering memory operations is therefore needed.
SUMMARY OF THE INVENTION
The problems outlined above are in large part solved by a processor having pre-cache and post-cache buffers as described herein. The pre-cache (or LS1) buffer stores memory operations which have not yet probed the data cache. The post-cache (or LS2) buffer stores the memory operations which have probed the data cache. As a memory operation probes the data cache, it is moved from the LS1 buffer to the LS2 buffer. Since misses and stores which have probed the data cache do not reside in the LS1 buffer, the scan logic for selecting memory operations from the LS1 buffer to probe the data cache may be simple and low latency, allowing for the load latency to the data cache for load hits to be relatively low. Furthermore, since the memory operations which have probed the data cache have been removed from the LS1 buffer, the simple scan logic may support high performance features such as allowing hits to proceed under misses, etc. Additionally, since the LS2 buffer receives memory operations which have probed the data cache and thus may be waiting for retirement or fill data from memory, reprobing from the LS2 buffer may be less performance critical than probing from the LS1 buffer. Accordingly, the LS2 buffer may be made deeper than the LS1 buffer to queue numerous misses and/or stores. In this fashion, it may be possible to maximize the use of external bus bandwidth to service the misses.
Broadly speaking, a processor is contemplated comprising a data cache and a load/store unit coupled thereto. The load/store unit includes first logic, second logic, and a buffer. The first logic is configured to select load and store memory operations to probe the data cache. The buffer is coupled to receive the load and store memory operations, and comprises a plurality of entries. The second logic is configured to allocate entries from the plurality of entries for the load and store memory operations, responsive to the load and store memory operations probing the data cache.
A method for performing memory operations in a processor is contemplated. A memory operation is selected to probe a data cache. The data cache is probed with the memory operation. The memory operation is stored in a buffer of load and store memory operations responsive to its selection. Each of the load and store memory operations in the buffer has probed the data cache.
Additionally, a computer system is contemplated, including a processor comprising a data cache and a load/store unit similar to the above-described processor. The computer system further includes an input/output (I/O) device. The input/output (I/O) device provides communication between the computer system and another computer system to which the I/O device is coupled.
REFERENCES:
patent: 5155816 (1992-10-01), Kohn
patent: 5276828 (1994-01-01), Dion
patent: 5440752 (1995-08-01), Lentz et al.
patent: 5487156 (1996-01-01), Popescu et al.
patent: 5490259 (1996-02-01), Hiraoka et al.
patent: 5526510 (1996-06-01), Akkary et al.
patent: 5557763 (1996-09-01), Senter et al.
patent: 5625835 (1997-04-01), Ebcioglu et al.
patent: 5652859 (1997-07-01), Mulla et al.
patent: 5692152 (1997-11-01), Cohen et al.
pate
Hughes William Alexander
Lewchuk William Kurt
Ramagopal Hebbalalu S.
Chace Christian P.
Conley Rose & Tayon PC
Kim Matthew
Merkel Lawrence J.
LandOfFree
Load/store unit having pre-cache and post-cache queues for... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Load/store unit having pre-cache and post-cache queues for..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Load/store unit having pre-cache and post-cache queues for... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2978081