Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
2002-05-01
2004-07-06
Thai, Tuan V. (Department: 2186)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S100000, C711S154000, C712S207000, C712S237000
Reexamination Certificate
active
06760818
ABSTRACT:
FIELD OF THE INVENTION
The field of the invention relates to the area of data pre-fetching from a computer memory and more specifically to the area of pre-fetching data from a computer memory in a manner to minimize processor stall cycles.
BACKGROUND OF THE INVENTION
As microprocessor speeds increase, processor performance is more and more affected by data access operations. When a processor, in operation, needs to await data due to slow data retrieval times, this is termed a processor stall and, in quantitative terms is referred to as processor stall cycles. A larger number of processor stall cycles is indicative a longer delay.
Early computer systems, suffered from the limitation of magnetic storage media speed. As such, caching of disk drive data is well known to enhance data access performance. In a typical caching operation, data is fetched or pre-fetched from its storage location to a cache—a temporary but faster memory hold—for more rapid access by the processor. Thus, the speed limitations of bulk storage media are obviated if the entire stored data is cached in RAM memory, for example.
Presently, processors are so fast that processor stall cycles even occur when retrieving data from RAM memory. The processor stall cycles are used to increase the time to allow data access operations to complete. As would be anticipated, pre-fetching of data from RAM memory is now performed to reduce processor stall cycles. Thus, different levels of cache memory supporting different memory access speeds are used for storing different pre-fetched data. When incorrect data is pre-fetched into the cache memory, a cache miss condition occurs which is resolvable through processor stall cycles. Incorrect data pre-fetched into the cache memory may result into cache pollution; i.e. removal of useful cache data to make place for non-useful pre-fetched data. This may result in an unnecessary cache miss resulting from the replaced data being needed again by the processor.
Memory is moved in data blocks to allow for faster transfer of larger blocks of memory. A data block represents a basic unit of data for transfer into or between different levels of cache memory hierarchy; typically, a data block contains multiple data elements. By fetching a data block into a higher level of cache memory hierarchy before the data block is actually required by the processor, the processor stall cycles due to a cache miss are avoided. Preferably, the highest level of cache memory hierarchy is such that a data block pre-fetched into said level of cache memory hierarchy is retrieved by the processor without any stall penalty; this yields peak processor performance. Of course, data blocks that are to be retrieved and that are not yet present in the highest level of the cache memory hierarchy are either subject to pre-fetching before they are needed or reduce overall processor performance.
Advantageously, a goal of pre-fetching in a processor-based system is to reduce a processing time penalty incurred by processor cache misses. As such has been addressed in the prior art. For example, in U.S. Pat. No. 6,272,516, a method is disclosed where the use of multiple processors reduces cache misses. U.S. Pat. No. 5,761,506, entitled “Method and apparatus for handling cache misses in a computer system”, also discloses a manner in which cache misses are reduced.
In the paper entitled, “Improving Processor Performance by Dynamically Pre-Processing the Instruction Stream,” Dundas J. D., The University of Michigan 1997, multiple dynamic pre-fetching techniques are disclosed as well as methods for their use. State of the art pre-fetching techniques usually rely on certain regularity in references to data stored in RAM made by the instructions executed by the processor. For example, successive executions of a memory reference instruction, such as a processor load instruction, may refer to memory addresses separated by a constant value, known as stride. This stride is used to direct a pre-fetch of a data block contained in an anticipated future referenced memory address. Thus, pre-fetching exploits a spatial correlation between memory references to improve processor performance, where the spatial correlation between data blocks is used to improve processor performance. In some cases, within cache memory, spatial locality of the data blocks is useful to improve performance. Prior Art U.S. Pat. No. 6,079,006, entitled “Stride-based data address prediction structure discloses a data prediction structure that stores a base addresses and stride values in a prediction array.
Pre-fetching may be directed by software, by means of programming, by compiler inserted pre-fetch instructions, or may be directed by means of hardware. In the case of hardware directed pre-fetching, the hardware tries to detect regularity in memory references and automatically, without the presence of explicit pre-fetch instructions in the program stream, generates pre-fetching of data blocks. Combined hardware/software based techniques are also known in the prior art. Although the prior art pre-fetching techniques are intended to improve processor performance, there are some downsides to using them.
For example, successive references to memory addresses A, A+200, A+400, and A+600, may direct the prior art pre-fetch mechanism to pre-fetch the data block containing address A+800, assuming a stride of 200, when the data block is not yet present in the higher level of cache memory hierarchy and has not yet been requested.
The process of pre-fetching data blocks uses a bus, which provides for communication between the memory, in the form of RAM, and cache memory, and as a result pre-fetching of data blocks from the memory uses the bus and therefore increased bus utilization and decreases bus bandwidth. This process of pre-fetching may also result in the pre-fetching of data blocks that will not be used by the processor, thereby adding an unnecessary load to the bus utilization where another fetch may be necessary for the processor in order to obtain the required data. Fetching a data block into a certain level of the cache memory hierarchy requires replacing of an existing cache data block, where the replacing of such a data block may result in extra bus utilization. Often, the cache data blocks are re-organized such that the block being replaced is moved to a lower level of the cache memory hierarchy. Furthermore, the moved data block is no longer available at the highest level of cache memory hierarchy for future reference and may result in other cache misses.
On the other hand, pre-fetching of extra data blocks, in anticipation of their use by the processor, may also result in bursty bus utilization, where the pre-fetches are not spread in time but follow each other rapidly in succession. This problem is most apparent when a series of pre-fetches are initiated to fetch multiple data blocks that hold for example data relating to a two dimensional sub-structure of a larger two dimensional structure. Such as in the case of a cut and paste operation, where a sub graphic image is fetched from a larger graphic image laid out in memory in row-order format. Bursty bus utilization may cause temporary starvation of other processor components that require the shared bus resource, which may result in other types of processor stall cycles, thus having a degrading effect on processor performance. Software directed pre-fetching typically requires insertion of pre-fetch instructions into the program stream being executed by the processor, thereby decreasing processor instruction bandwidth. Hardware directed pre-fetching usually requires a non-negligible amount of chip area to detect regularity in memory references. In the prior art, the use of memories of several kilobytes to monitor memory references is not unknown for hardware based techniques. Such hardware techniques are employed such that pre-fetching of data blocks is initiated early enough so that the pre-fetch is completed by the time the pre-fetched data is actually required by the processor, otherwise t
Koninklijke Philips Electronics , N.V.
Thai Tuan V.
Waxler Aaron
LandOfFree
Memory region based data pre-fetching does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Memory region based data pre-fetching, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Memory region based data pre-fetching will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3227499