Electrical computers and digital processing systems: processing – Processing control – Branching
Reexamination Certificate
1999-05-28
2001-11-20
Kim, Kenneth S. (Department: 2783)
Electrical computers and digital processing systems: processing
Processing control
Branching
C711S213000, C712S225000, C717S152000
Reexamination Certificate
active
06321330
ABSTRACT:
BACKGROUND OF THE INVENTION
1. Technical Field
The present invention relates to methods for prefetching data, and in particular, to methods for performing prefetches within a loop.
2. Background Art
Currently available processors run at clock speeds that are significantly faster than the clock speeds at which their associated memories operate. It is the function of the memory system to mask this discrepancy between memory and processor speeds, and to keep the processor's execution resources supplied with data. For this reason, memory systems typically include a hierarchy of caches, e.g. L
0
, L
1
, L
2
. . . , in addition to a main memory. The caches are maintained with data that the processor is likely to request by taking advantage of the spatial and temporal locality exhibited by most program code. For example, data is loaded into the cache in blocks called “cache lines” since programs tend to access data in adjacent memory locations (spatial locality). Similarly, data that has not been used recently is preferentially evicted from the cache, since data is more likely to be accessed when it has recently been accessed (temporal locality).
The advantages of storing data in caches arise from their relatively small sizes and their attendant greater access speeds. They are fast memory structures that can provide data to the processor quickly. The storage capacities of caches generally increase from L
0
to L
2
, et seq., as does the time required by succeeding caches in the hierarchy to return data to the processor. A data request propagates through the cache hierarchy, beginning with the smallest, fastest structure, until the data is located or the caches are exhausted. In the latter case, the requested data is returned from main memory.
Despite advances in the design of memory systems, certain types of programming structures can still place significant strains on their ability to provide the processor with data. For example, code segments that access large amounts of data from loops can rapidly generate mulitple cache misses. Each cache miss requires a long latency access to retrieve the target data from a higher level cache or main memory. These accesses can significantly reduce the computer system's performance.
Prefetching is a well known technique for masking the latency associated with moving data from main memory to the lower level caches (those closest to the processor's execution resources). A prefetch instruction is issued well ahead of the time the targeted data is required. This overlaps the access with other operations, hiding the access latency behind these operations. However, prefetch instructions bring with them their own potential performance costs. Prefetch requests add traffic to the processor memory channel, which may increase the latency of loads. These problems are exacerbated for loops that load data from multiple arrays on successive loop iterations. Such loops can issue periodic prefetch requests to ensure that the array data is available in the low level caches when the corresponding loads are executed. As discussed below, simply issuing requests on each loop iteration generates unnecessary, i.e. redundant, memory traffic and bunches the prefetches in relatively short intervals.
A prefetch returns a line of data that includes the requested address to one or more caches. Each cache line typically includes sufficient data to provide array elements for multiple loop iterations. As a result, pefetches do not need to issue on every iteration of the loop. Further, generating too many prefetch requests in a short interval can degrade system performance. Each prefetch request consumes bandwidth in the processor-memory communication channel, increasing the latency for demand fetches and other operations that use this channel. In addition, where multiple arrays are manipulated inside a loop, prefetch operations are provided for each array. Cache misses for these prefetches tend to occur at the same time, further burdening the memory subsystem with bursts of activity. One method for dealing with some of these issues is loop unrolling.
A portion of an exemplary loop (I) is shown below. The loop loads and manipulates data from five arrays A, B, C, D, and E on each loop iteration.
(I)
Orig_Loop:
load A(I)
load B(I)
load C(I)
load D(I)
load E(I)
. . .
branch Orig_Loop
FIG. 1
represents loop (I) following its modification to incorporate prefetching. Here, it is assumed that each array element is 8 bytes and each cache line returns 64 bytes, in which case a prefetch need only be issued for an array on every eighth iteration of the loop. This is accomplished in
FIG. 1
by unrolling loop (I) 8 times, and issuing a prefetch request for each array with the instruction groups for successive array elements. Unrolling the loop in this manner adjusts the amount of data that is consumed on each iteration of the loop to equal the amount of data that is provided by each prefetch, eliminating redundant prefetches. On the other hand, loop unrolling can significantly expand a program's footprint (size) in memory, and it fails to address the bursts of prefetch activity that can overwhelm the memory channel.
An alternative approach to eliminating redundant prefetches is to predicate the prefetches, calculate the predicate values on successive iterations to gate the appropriate prefetch(es) on or off. The instruction necessary to implement the predicate calculations expand the code size and, depending on the conditions to be determined, can slow down the loop.
The present invention addresses these and other issues related to implementing prefetches from loops.
SUMMARY OF THE INVENTION
The present invention reduces the instruction overhead and improves scheduling for software data prefetches. Register rotation is used to distribute prefetches over selected loop iterations, reducing the number of prefetches issued in any given iteration. It is particularly useful for programs that access large amounts of data from within loops.
In accordance with the present invention, data is prefetched within a loop by a prefetch operation that is parameterized by a value in a register. Data targeted by the prefetch operation is adjusted by rotating a new value into the register.
For one embodiment of the invention, the register that parameterizes the prefetch operation is a rotating register that indicates the address to be prefetched. Rotating a new value into the register alters the prefetch target for a subsequent iteration of the loop. For another embodiment of the invention, the register is a rotating predicate register that activates or deactivates the prefetch operation according to the current value of the predicate it stores. Rotating a new value into the register activates or deactivates the prefetch operation for the next iteration of the loop.
REFERENCES:
patent: 5357618 (1994-10-01), Mirza et al.
patent: 5704053 (1997-12-01), Santhanam
patent: 5752037 (1998-05-01), Gornish et al.
patent: 5754876 (1998-05-01), Tamaki et al.
patent: 5854934 (1998-12-01), Hsu et al.
patent: 5881263 (1999-03-01), York et al.
patent: 5889985 (1999-03-01), Babaian et al.
patent: 6148439 (2000-11-01), Nishiyama
patent: WO 98/06041 (1998-02-01), None
Doshi Gautam B.
Muthukumar Kalyan
Intel Corporation
Kim Kenneth S.
Novakoski Leo V.
LandOfFree
Each iteration array selective loop data prefetch in... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Each iteration array selective loop data prefetch in..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Each iteration array selective loop data prefetch in... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2586314