Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
1999-03-31
2002-01-08
Gossage, Glenn (Department: 2187)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S138000, C711S139000, C711S144000, C711S146000, C711S141000, C710S022000
Reexamination Certificate
active
06338119
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates in general to data processing systems and in particular to processing systems which pre-fetch data from a main memory and one or more cache memories. More particularly, the present invention relates to improving performance of direct memory access and cache memory.
DESCRIPTION OF THE PRIOR ART
In modem microprocessor systems, processor cycle time continues to decrease as technology continues to improve. Also, design techniques of speculative execution, deeper pipelines, more execution elements and the like, continue to improve the performance of processing systems. The improved performance puts a heavier burden on the system's memory interface since the processor demands data and instructions more rapidly from memory. To increase the performance of processing systems, cache memory systems arc often implemented.
Processing systems employing cache memories are well known in the art. Cache memories are very high-speed memory devices that increase the speed of a data processing system by making current programs and data available to a control processor unit (“CPU”) with a minimal amount of latency. Large on-chip caches (Level
1
or L
1
caches) are implemented to help reduce memory latency, and they are often augmented by larger off-chip caches (Level
2
or L
2
caches). The cache serves as a storage area for cache line data. Cache memory is typically divided into “lines” with each line having an associated “tag” and attribute bits. The lines in cache memory contain copies of data from main memory. For instance, a “4K page” of data in cache may be defined as comprising
32
lines of data from memory having 128 bytes in each line.
The primary advantage behind cache memory systems is that by keeping the most frequently accessed instructions and data in the fast cache memory, the average memory access time of the overall processing system will approach the access time of the cache. Although cache memory is only a small fraction of the size of main memory, a large fraction of memory requests are successfully found in the fast cache memory because of the “locality of reference” property of programs. This property holds that memory references are confined to a few localized areas of memory (in this instance, the L
1
and L
2
caches, herein after referred to as the “L
1
/L
2
” cache).
The basic operation of cache memories is well-known. When the processor needs to access memory, the cache is examined. If the word addressed by the processor is found in the cache, it is read from the fast cache memory. If the word addressed by the processor is not found in the cache, the main memory is accessed to read the word. A block of words containing the word being accessed is then transferred from main memory to cache memory. In this manner, additional data is transferred to cache (pre-fetched) so that future references to memory will likely find the required words in the fast cache memory.
Pre-fetching techniques are often implemented to supply memory data to the on-chip L
1
cache ahead of time to reduce latency. Ideally, data and instructions are pre-fetched far enough in advance so that a copy of the instructions and data is always in the L
1
cache when the processor needs it. Pre-fetching of instructions and/or data is well-known in the art.
In a system which requires high Input/Output (I/O) Direct Memory Access (DMA) performance (i.e., graphics), a typical management of system memory data destined for I/O may be as follows:
1) A system processor produces data by doing a series of stores into a set of 4 Kilobyte (4K) page buffers in system memory space. This causes the data to be marked as ‘modified’ (valid in the cache, not written back to system memory) in the L
1
/L
2
cache.
2) The processor initiates an I/O device to perform a DMA Read to these 4K pages as they are produced.
3) The I/O device does a series of DMA reads into system memory.
4) A Peripheral Component Interconnect or PCI Host bridge, which performs DMA operations on behalf of the I/O device, pre-fetches and caches data in a ‘shared’ (valid in cache, valid in system memory) state. The L
1
/L
2
caches changes each data cache line from the ‘modified’ state to the ‘shared’ state as the PCI Host Bridge reads the data (i.e., the L
1
/L
2
caches intervene and either supplies the data directly or ‘pushes’ it to memory where it can be read).
5) When the DMA device finishes, the 4K buffer is re-used (i.e., software has a fixed set of buffers that the data circulates through).
In order to maintain DMA I/O performance, a PCI Host Bridge may contain its own cache which it uses to pre-fetch/cache data in the shared state. This allows DMA data to be moved close to the data consumer (i.e., an I/O device) to maximize DMA Read performance. When the PCI Host Bridge issues a cacheable read on the system bus, this causes the L
1
/L
2
cache to go from the ‘modified’ to the ‘shared’ state due to the PCI host bridge performing a cacheable read. This state changing action produces a performance penalty when the software wants to re-use this 4K page cache space to store the new DMA data since every line in the L
1
/L
2
cache has been changed to the ‘shared’ state. In order for the new stores to take place, the L
1
/L
2
cache has to perform a system bus command for each line to indicate that the line is being taken from ‘shared’ to ‘modified.’ This must occur for each cache line (there are
32
) in the 4K page even though the old data is of no use (the PCI Host Bridge needs an indication that its data is now invalid). The added memory coherency traffic,
32
system bus commands, that must be done on the system bus to change the state of all these cache lines to ‘modified’ before the new store may be executed can degrade processor performance significantly.
It has been shown that stores to a 4K page by the processor may take 4-5 times longer when the L
1
/L
2
cache is in the ‘shared’ state as opposed to being in the ‘modified’ state. This is due to added coherency traffic needed on the system bus to change the state of each cache line to ‘modified’
It would be desirable to provide a method and apparatus that increase the speed and efficiency of a Direct Memory Access device. It would also be desirable to provide a method and apparatus to reduce the number of system bus commands required to change state of a page of data in the L
1
/L
2
cache.
SUMMARY OF THE INVENTION
It is therefore one object of the present invention to provide a method and apparatus that will reduce the number of system bus commands required to change the state of a buffer in an L
1
/L
2
cache.
It is another object of the present invention to provide a method and apparatus that will increase the speed and efficiency of Direct Memory Access (DMA)devices.
It is yet another object of the present invention to provide a method and apparatus that allow a cache to clear a memory buffer with one bus operation.
The foregoing objects are achieved as is now described. A method and system for improving direct memory access and cache performance utilizing a special Input/Output or ‘I/O’ page is defined as having a large size (e.g., 4 Kilobytes), but with distinctive cache line characteristics. For DMA reads, the first cache line in the I/O page may be accessed, by a PCI Host Bridge, as a cacheable read and all other lines are non-cacheable access (DMA Read with no intent to cache). For DMA writes, the PCI Host Bridge accesses all cache lines as cacheable. The PCI Host Bridge maintains a cache snoop granularity of the I/O page size for data, which means that if the Host Bridge detects a store (invalidate) type system bus operation on any cache line within an I/O page, cached data within that page is invalidated (L
1
/L
2
caches continue to treat all cache lines in this page as cacheable). By defining the first line as cacheable, only one cache line need be invalidated on the system bus by the L
1
/L
2
cache in order to cause invalidation or “killing” of the whole page of data in the PCI Host Bridge. All stores to the other cach
Anderson Gary Dean
Arroyo Ronald Xavier
Frey Bradly George
Guthrie Guy Lynn
Bracewell & Patterson L.L.P.
Gossage Glenn
Salys Casimer K.
LandOfFree
Method and apparatus with page buffer and I/O page kill... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method and apparatus with page buffer and I/O page kill..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus with page buffer and I/O page kill... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2830412