Electrical computers and digital processing systems: processing – Processing control – Branching
Reexamination Certificate
1995-09-01
2001-08-07
Maung, Zarni (Department: 2154)
Electrical computers and digital processing systems: processing
Processing control
Branching
C712S225000, C712S226000
Reexamination Certificate
active
06272622
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates to a method of and a circuit for prefetching instructions/data using a non-referenced prefetch (NRP) cache, capable of not only reducing the number of cache misses and decreasing the access latency to the lower memory, but also reducing the memory traffic by storing blocks prefetched from the lower memory, but not referenced by the central processing unit (CPU) in an NRP cache.
2. Description of the Prior Art
Although CPUs are progressing very rapidly in performance, the progression of memories in performance is not as fast as that of such CPUs. Such a performance gap between the CPU and the memory will become more severe as time goes by. Efficiently constructing memory hierarchy is highlighted as an important factor that affects the overall performance of the computer system.
In order to obtain an efficient memory system, most modern computer systems use a cache memory. Basically, such a cache memory is adapted to utilize the locality of reference exhibited when the computer executes a program. During the execution of a program in a computer, the central processing unit (CPU) of the computer references the lower memory. In a certain limited time period, such references are concentrated in some portion of the lower memory. If the CPU references an item, there is a tendency for the items having adjacent addresses to also he referenced. This is referred to as “spatial locality”. In case of general programs. several loops are executed which require most of the entire program's execution time. In one loop, the same instructions are repeatedly executed. In particular, it is often the case where the currently referenced instructions will be referenced again shortly thereafter. This is referred to as “temporal locality”. The principle of the cache memory is to utilize such reference localities, namely, to store the frequently used blocks of the lower memory in a memory with a very high reference speed interposed between the lower memory and the CPU. Since most of the application programs have both the above-mentioned features, the cache memory can process 90% or more of the entire references to the lower memory by the CPU even when it has a small memory capacity of several kilobytes.
In the use of the cache memory, if an instruction block which the CPU is to reference exists within the cache memory, the CPU readily refers to that block. This status is called “cache hit”. On the other hand, the status that the cache memory has no instruction block to be referenced by the CPU is called “cache miss”. A hit ratio is used as a measure of the performance of the cache memory. The hit ratio is expressed by the following equation: 
Hit
⁢
 
⁢
Ratio
=
Number
⁢
 
⁢
of
⁢
 
⁢
Cache
⁢
 
⁢
Hits
Total
⁢
 
⁢
Number
⁢
 
⁢
of
⁢
 
⁢
Memory
⁢
 
⁢
References
Meanwhile, cache misses are classified into three kinds, namely, compulsory miss, conflict miss, and capacity miss. The compulsory miss is a cache miss occurring when a block is initially referenced. The conflict miss is a cache miss occurring when a block mapped in the corresponding cache memory location, but subsequently replaced by another block is referenced. On the other hand, the capacity miss is a cache miss occurring when the working set, that is, a set of pages frequently referenced by the CPU in execution of an application program, has a larger capacity than the cache memory.
These kinds of cache misses have different ratios thereof to the entire cache misses as the capacity of the cache memory increases. Although the number of compulsory misses is approximately constant irrespective of a variation in the capacity of the cache memory, numbers of conflict misses and capacity misses tend to decrease as the capacity of the cache memory increases. Since the on-chip cache memory capacity has an increasing trend owing to the development of very large scale integration (VLSI), the portion of compulsory misses over the entire cache misses also tends to increase. Under such circumstances, methods for effectively reducing the number of compulsory misses become more important because most modern computer systems use a large capacity on-chip cache memory.
Of the methods for reducing the number of compulsory misses, the simplest one is to increase the size of a cache block. This is because more instructions are brought into the cache memory on a cache miss and further cache misses caused by sequential access of memory references can be reduced with a larger cache block. However, this approach involves an increase in the CPU cycle taken to fetch a block from the lower memory to the cache memory and an increase in the memory traffic. Moreover, cache pollution may occur due to the large cache block size because the entire block should be replaced on a cache miss even when only a portion of the block is referenced. Such cache pollution results in a degradation in performance. Because the size of a cache block should be selected based oil architectural features such as memory latency and transfer rate, it can not be simply increased only for the purpose of instruction prefetching.
In order to overcome the above-mentioned disadvantages, various prefetching techniques halve been proposed. Prefetchlinig is to fetch a memory block expected to be referenced from the lower memory to the upper memory prior to a reference to the memory block by the CPU. The sequential prefetching is the simplest form of prefetching. In sequential prefetching, the sequential next block of the block currently referenced by the CPU is prefetched. This sequential prefetching technique can provide a great increase in performance for an application program involving a memory reference with a large sequentiality. Generally, the reference to instructions has a large sequentiality as compared to the reference to data. By virtue of such a feature, the sequential prefetching provides a relatively superior performance. Also, the sequential prefetching requires less additional hardware to achieve its prefetch purpose. When the reference to instructions does not follow sequential execution flows, however, no performance improvement is expected by the sequential prefetching. In other words, the sequential prefetching has a disadvantage that the performance gain obtained by the sequential prefetching is not so high when the memory reference is executed along non-sequential execution flows, as in the case of conditional branch instructions or unconditional branch instructions.
Target prefetching is to determine a block to be prefetched by using history information about previous control flows stored in a prediction table. This target prefetching is based on the fact that branch instructions, even if they are conditional branches, tend to follow the previous control flows again in many cases. If a sequential block of a particular block was referenced in previous execution, the sequential block will be prefetched. On the other hand, if a non-sequential block of a particular block was referenced to in previous execution, the non-sequential block will be prefetched. For example, where a block B was referenced after block A in previous execution, the block B will be prefetched upon referencing the block A at later times. Target prefetching has a higher prefetch accuracy thin sequential prefetching because it utilizes the characteristic of branch instructions. However it is difficult to expect a great improvement in performance by the target prefetchinig because branch instructions do not always follow the previous control flows and especially in cases that the memory reference to branch instructions follows sequential and non-sequential control flows in an alternating manner.
A hybrid prefetching is a prefetch method to prefetch both sequential and non-sequential blocks to overcome these shortcomings. But the hybrid prefetching can used in supercomputers which have less memory bandwidth restriction. In the case of microprocessors, 
Han Tack-Don
Kim Shin-Dug
Park Gi-Ho
Hyundai Electronics Industries Co,. Ltd.
Maung Zarni
LandOfFree
Method of and circuit for instruction/data prefetching using... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Method of and circuit for instruction/data prefetching using..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method of and circuit for instruction/data prefetching using... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2498310