Electrical computers and digital processing systems: memory – Addressing combined with specific memory configuration or... – Addressing cache memories
Reexamination Certificate
1998-06-17
2001-06-05
Yoo, Do Hyun (Department: 2187)
Electrical computers and digital processing systems: memory
Addressing combined with specific memory configuration or...
Addressing cache memories
C709S241000
Reexamination Certificate
active
06243788
ABSTRACT:
TECHNICAL FIELD
This invention relates to scheduling threads in a computer system. In particular, this invention relates to methods and systems for facilitating the tracking of cache footprints of threads, for the purpose of cache sensitive scheduling.
DESCRIPTION OF THE PRIOR ART
In modern computer architectures, the speed of the CPU is progressively scaling up much faster than the memory access speed. It therefore has become increasingly important to deal with the effects of memory latency. To mitigate the relatively high memory access time, computer systems, as shown in
FIG. 1
, interpose larger and larger sized caches (
14
a
,
14
b
) between the &mgr;P (
12
) and the memory (
13
) and often even multiple levels of caches are deployed (
14
,
15
) [
1
]. Nevertheless, the discrepancy in CPU speed increase to memory access speed increase results in greater cache reload time, in terms of CPU cycles, in the case a cache miss occurs. Various techniques are known to hide cache misses such as instruction dependency analysis, speculative execution, out-of-order execution and prefetching [
1
]. With the increasing discrepancy between cache access time and memory latency, it will become increasingly difficult to hide cache misses using these techniques. As a result, &mgr;Ps will experience more stalls, thus increasing the average number of cycles required to execute an instruction (cpi) . In order to keep a computer system's cpi low, it is therefore important to reduce the number of cache misses a &mgr;P suffers.
Cache misses typically occur due to limited cache resources, where the working set of the active threads on a processor can not be presented in its entirety in the cache. In this case, switching among the threads will result in cache misses because memory accessed by one thread will evict the cache content of other threads. One obvious way to alleviate this problem is to increase the time quantum of executing threads, thus increasing the probability of cache reuse during that longer period. However, increasing the time quantum has adverse effects in terms of other system parameters, such as response time, and hence, this is not generally an option.
Today, many, or even most, modern server systems are cache coherent shared memory multiprocessor systems (MPs) (
11
), where multiple processors (
12
a
,
12
b
) are linked to one or more memory modules (
13
) [
1
]. In these systems, cache misses occur when a thread's execution migrates from one processor to another, yet part of its accessed memory is still cached on the previous processor's cache. Upon access of these memory locations, cache misses occur resulting in the transfer of cache-lines to the new processor. Schedulers on such systems can improve both throughput and responsiveness by considering not only the priority of the threads being scheduled, but also the affinity of threads to the different processors [
2
,
4
,
5
,
6
]. If threads are typically scheduled to processors to which they have a high affinity, then the overall number of cache misses are reduced, and hence throughput is increased. Minor delays in scheduling a thread so as to schedule the thread on a processor to which it has affinity can actually increase the responsiveness of the thread, since when the thread does get to run, the processor will spend less time reestablishing its cache contents. While many multiprocessor schedulers do attempt some form of affinity based scheduling, the effectiveness of this scheduling is limited, since there is no way for the scheduler to make accurate estimates of cache affinity.
The first attempt in operating systems and still the one that is most widely spread in commercial operating systems for multiprocessor systems is the use of virtual time stamps. Here, upon execution of a thread T on processor P
i
a per-processor time stamp is assigned to the thread. Threads with the highest time stamp for a given processor are assigned a higher affinity value. Often very simple implementations are provided for this concept, namely a value of “1” if the thread ran here last or “0” otherwise. This method does not take the cache footprint of a thread into account. It assumes, often incorrectly, that a thread most recently run on a processor has the highest affinity to that processor.
Many processors have introduced mechanisms to account for the number of cache misses during a set interval, and operating systems are starting to utilize this information [
3
,
4
,
5
,
6
]. In the minimum misses strategy the scheduler remembers the number of cache misses a thread suffered during its last run. The lower the number of cache misses for a given thread, the higher is it's assigned cache affinity. A more elaborate strategy is based on the cache reload transient model. The Reload Transient is defined as the cost to reestablish the footprint of a thread after restarting it. A Markov Chain Model can be used to estimate the footprint of a thread at a given time [
3
,
5
,
6
]. In particular, the markov chain models the probabilities of increasing the number of active cachelines as a consequence of a cache miss during a thread's execution. For instance, assuming a system with N cachelines and a running thread T currently holding M cachelines, the probability that a cache miss increases T's cache footprint (i.e. none of T's cachelines were replaced by the miss) is (N−M)/N. The chain is then constructed by applying the same logic for more than one cache miss. Similarly, the same model can be used to estimate the reduction in a thread's cache footprint given the number of cache misses since the thread's last execution. At the time of scheduling it then makes sense to select the thread with the lowest reload transient, as we expect it to suffer the least cache misses to restore its previous state. This strategy assumes the system to be markovian, that is history less, which might not accurately reflect a thread's behavior nor reflect the cache hardware restrictions, such as cache associativity [
1
]. Furthermore, since the cache footprint is incrementally estimated over the lifetime of the thread this model can get out of sync, resulting in poor scheduling decisions.
The main impediment of current affinity based scheduling schemes, as described above, is that the cache affinity function is either based on very simple heuristics, e.g. virtual time stamps, or they are based on cache footprint estimations, e.g. stochastically models such as markov chains.
What is required, therefore, is an operating system with improved cache affinity based scheduling that is based on accurate cache footprint measurements.
REFERENCES
1. J. L. Hennessy, D. A. Patterson, “Computer Architecture: A Quantitative Approach,” Morgan Kaufmann Publishers, ISBN 1-55860-329-8, 1996.
2. U. Vahalla, “UNIX Internals: The New Frontier,” Prentice Hall, ISBN 0-13-101908-2, 1996.
3. D. Thiebaut, H. Stone, “Footprints in the Cache,” ACM Transactions on Computer Systems, 5(4), November 1987, pp. 305-329.
4. M. Squillante, E. Lazowska, “Using Processor Cache Affinity in Shared-Memory Multiprocessor Scheduling,” IEEE Transactions on Parallel and Distributed Systems, 4(2), February 1993, pp. 131-143.
5. A. Tucker, “Efficient Scheduling on Multiprogrammed Shared Memory Multiprocessors,” Ph.D. Thesis, Department of Computer Science, Stanford University, CX-TN-94-4, December 1993.
6. F. Belossa, “Locality-Information-Based Scheduling in Shared-Memory Multiprocessors,” IPPS'96 Workshop on Job Scheduling Strategies for Parallel Processing, Honolulu, Hi., April 1996.
References 1 through 6 above are hereby incorporated herein by reference.
SUMMARY OF THE INVENTION
It is an object of this invention to provide improved cache affinity based scheduling. Accordingly, this invention provides a method and apparatus for scheduling threads in a multiprocessor system by measuring a cache footprint for each of the threads for each of the processors. Then, the affinity for each of
Baransky Yurij Andrij
Franke Hubertus
Krieger Orran Yaakov
Pattnaik Pratap Chandra
Cameron Douglas W.
Dougherty Anne Vachon
International Business Machines - Corporation
Moazzami Nasser
Yoo Do Hyun
LandOfFree
Cache architecture to enable accurate cache sensitivity does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Cache architecture to enable accurate cache sensitivity, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Cache architecture to enable accurate cache sensitivity will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2465186