Read prediction algorithm to provide low latency reads with...

Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C711S119000, C711S137000, C711S213000

Reexamination Certificate

active

06801982

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to computer systems and, more specifically, to cache controllers employed by computer processors.
2. Description of the Prior Art
To improve computer speed, modern computer processing units employ several caches, such as an instruction cache and a data cache, using high speed memory devices. Caches are commonly used to store data that might be repeatedly accessed by a processor to speed up processing by avoiding the longer step of reading the data from memory. Each cache is associated with a cache controller that manages the transfer of data between the processor core and the cache memory.
A cache has many “blocks” that individually store various instructions and data values. The blocks in any cache are divided into groups of blocks called “sets” or “congruence classes.” A set is the collection of cache blocks that a given memory block can reside in. For any given memory block, there is a unique set in the cache that the block can be mapped into. The number of blocks in a set is referred to as the associativity of the cache, e.g., 2-way set associative means that for any given memory block there are two blocks in the cache that the memory block can be mapped into. However, several different blocks in main memory can be mapped to any given set. A 1-way set associate cache is direct mapped, that is, there is only one cache block that can contain a particular memory block. A cache is said to be fully associative if a memory block can occupy any cache block, i.e., there is one congruence class, and the address tag is the full address of the memory block.
An exemplary cache line (block) includes an address tag field, a state bit field, an inclusivity bit field, and a value field for storing the actual instruction or data. The state bit field and inclusivity bit fields are used to maintain cache coherency in a multiprocessor computer system to indicate the validity of the value stored in the cache. The address tag is a subset of the full address of the corresponding memory block. A compare match of an incoming address with one of the tags within the address tag field indicates a cache “hit.” The collection of all of the address tags in a cache (and sometimes the state bit and inclusivity bit fields) is referred to as a directory, and the collection of all of the value fields is the cache entry array.
When all of the blocks in a congruence class for a given cache are full and that cache receives a request, whether a “read” or a “write,” to a memory location that maps into the full congruence class, the cache must “evict” one of the blocks currently in the class. The cache chooses a block by one of a number of means known to those skilled in the art (least-recently used (LRU), random, pseudo-LRU, etc.) to be evicted. If the data in the chosen block is modified, that data is written to the next lowest level in the memory hierarchy which may be another cache or main memory. By the principle of inclusion, the lower level of the hierarchy will already have a block available to hold the written modified data. However, if the data in the chosen block is not modified, the block is simply abandoned and not written to the next lowest level in the hierarchy. This process of removing a block from one level of the hierarchy is known as an “eviction.” At the end of this process, the cache no longer holds a copy of the evicted block.
In a system with central processing unit processors (also referred to herein as “CPU's”) running at very high frequencies, system performance can be highly sensitive to main memory latency. One method to reduce latency is to use an L3 cache which is shared by multiple CPU's in the system. Since many of today's CPU's have fairly large L2 caches, the shared cache (L3 cache) must be very large to have a marked impact on system performance. Unfortunately, large L3 caches built from static RAM (SRAM) chips can be quite expensive. A more cost-effective approach is to use synchronous dynamic RAM (SDRAM) chips. The primary drawback with SDRAM is a longer latency and a cycle time of a given memory bank, which can be ten times or so greater than that for high speed SRAM. The cycle time problem can be alleviated by employing many banks in the L3 cache such that the probability of accessing a busy bank is low. However, the latency is still fairly high, and thus the access should start as soon as possible.
In a computer system, read requests (also referred to as “load requests”) coming from a given CPU can be satisfied (i) by another CPU if the memory value is held in one of the CPU's caches (e.g., held in a modified or exclusive coherency state using a coherency protocol), (ii) by main memory, or (iii) by a shared cache (in this example a level 3 or L3 cache). One method to reduce latency of data supplied by the L3 cache is to access L3 data speculatively. In other words, the L3 data array is accessed in parallel with the directory and before the transaction snoop responses are known from the other CPU's. This approach can have the advantage of getting the data to the requesting CPU in the minimum amount of time in a system with low system bus utilization. However, when the system is highly utilized, there can be a significant amount of L3 data bandwidth wasted on L3 misses, or hits to modified data in another CPUs L2 cache. The net effect of the increased bandwidth usage can actually be higher average latency. To avoid this problem, the L3 cache access can be delayed until after the directory lookup and snoop responses are known. However, serially accessing the directory can also add a non-trivial amount of latency to data sourced by the L3 cache.
In an application-specific integrated circuit (ASIC) application using external Synchronous Dynamic Random Access Memory (SDRAM), the SDRAM chips have a turnaround penalty when switching from stores to reads. SDRAM's can be operated in either page mode, with many accesses to the same page, or non-page mode where each access opens a bank, performs the memory access, and closes the bank with auto precharge. Commercial workloads have a high percentage of random accesses and as a result, page mode does not provide any performance benefit.
In non-page mode, SDRAM chips are designed for peak performance when consecutive accesses are performed to different banks. A read is performed by first opening a bank, issuing a read command, waiting the requisite number of cycles for the CAS latency, then the data is burst from the SDRAM into the memory controller. The memory controller must wait several cycles for the row to precharge (tRP) before reactivating that bank. A write is performed by opening a bank, issuing a write command, waiting the requisite number of cycles for the write CAS latency, bursting the data from the memory controller to the SDRAM's, then waiting for the write recovery (tWR) as well as the row precharge time (tRP). Due to the extra time required for write recovery as well as waiting for the write to complete, it is time consuming to turn the bus around from performing writes to reads. This is called the bus turnaround penalty. When a read is waiting for a store to complete, an SDRAM utilization conflict is said to occur.
To minimize the bus turnaround penalty and improve performance, memory interfaces have used read and store queues. Using read queues, the processor may issue reads faster than data can be returned from memory, allowing the memory controller to take advantage of bank access patterns to improve memory bandwidth. Using store queues, the processor can pass the write to the memory interface and continue processing under the assumption that the memory interface will perform the store to memory at a future time. Since the processor will wait for the results of a memory read, a memory interface will prioritize reads over stores.
Due to the read prioritization and the limited size of the store queue, the memory controller must ensure that the store queue can accept all stores. This can be done wit

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Read prediction algorithm to provide low latency reads with... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Read prediction algorithm to provide low latency reads with..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Read prediction algorithm to provide low latency reads with... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3292873

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.