Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
1996-12-17
2001-02-13
Thai, Tuan V. (Department: 2752)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S125000, C711S126000, C702S182000
Reexamination Certificate
active
06189072
ABSTRACT:
TECHNICAL FIELD
The present invention relates in general to data processing systems, and in particular, to performance monitoring in data processing systems.
BACKGROUND INFORMATION
In typical computer systems utilizing processors, system developers desire optimization of execution software for more effective system design. Usually, studies of a program's access patterns to memory and interaction with a system's memory hierarchy are performed to determine system efficiency. Understanding the memory hierarchy behavior aids in developing algorithms that schedule and/or partition tasks, as well as distribute and structure data for optimizing the system.
Performance monitoring is often used in optimizing the use of software in a system. A performance monitor is generally regarded as a facility incorporated into a processor to monitor selected characteristics to assist in the debugging and analyzing of systems by determining a machine's state at a particular point in time. Often, the performance monitor produces information relating to the utilization of a processor's instruction execution and storage control. For example, the performance monitor can be utilized to provide information regarding the amount of time that has passed between events in a processing system. The information produced usually guides system architects toward ways of enhancing performance of a given system or of developing improvements in the design of a new system.
Prior art approaches to performance monitoring include the use of test instruments. Unfortunately, this approach is not completely satisfactory. Test instruments can be attached to the external processor interface, but these cannot determine the nature of internal operations of a processor. Test instruments attached to the external processor interface cannot distinguish between instructions executing in the processor. Test instruments designed to probe the internal components of a processor are typically considered prohibitively expensive because of the difficulty associated with monitoring the many busses and probe points of complex processor systems that employ pipelines, instruction prefetching, data buffering, and more than one level of memory hierarchy within the processors. A common approach for providing performance data is to change or instrument the software. This approach however, significantly affects the path of execution and may invalidate any results collected. Consequently, software-accessible counters are incorporated into processors. Most software-accessible counters, however, are limited in the amount of granularity of information they provide.
Further, a conventional performance monitor is usually unable to capture machine state data until an interrupt is signaled, so that results may be biased toward certain machine conditions that are present when the processor allows interrupts to be serviced. Also, interrupt handlers may cancel some instruction execution in a processing system where, typically, several instructions are in progress at one time. Further, many interdependencies exist in a processing system, so that in order to obtain any meaningful data and profile, the state of the processing system must be obtained at the same time across all system elements. Accordingly, control of the sample rate is important because this control allows the processing system to capture the appropriate state. It is also important that the effect that the previous sample has on the sample being monitored is negligible to ensure the performance monitor does not affect the performance of the processor. Accordingly, there exists a need for a system and method for effectively monitoring processing system performance that will efficiently and noninvasively identify potential areas for improvement. A more effective performance monitoring system has been disclosed in the cross-referenced applications noted above.
However, these systems are not wholly sufficient for all purposes and hence may be expanded upon in a way that assists architects and implementers in improving computer system performance through better understanding of the effect of the memory hierarchy on the performance of the processor in question.
Consider the linear performance model (or just linear model) that is standardly used to evaluate and compare performance of central processing units (CPUs). The equation is usually stated as follows:
CPI_finite=CPI_infinite+DC_miss_ratio*DC_miss_penalty+IC_miss_ratio*IC_miss_penalty
The following serves to define the six factors in the above equation:
CPI_finite=cycles per instruction of a given implementation when executing a particular workload
CPI_infinite=the minimum cycles per instruction required on average to execute a given workload when the closest level of the memory hierarchy (typically the primary (L
1
) caches) always has the needed information
DC_miss_ratio=number of L
1
data cache misses per instruction on average
IC_miss_ratio=number of L
1
instruction cache misses per instruction on average
DC_miss_penalty=Average number of cycles per L
1
data cache miss per instruction
IC_miss_penalty=Average number of cycles per L
1
instruction cache miss per instruction
These six factors, specifically CPI_finite, CPI_infinite, DC_miss_ratio, IC_miss_ratio, DC_miss_penalty, and IC_miss penalty, shall be referred to as the CPU performance signature parameters, or for brevity, simply as the parameters or factors.
Clearly, any five of these factors will serve to define all six (i.e., if only one factor is not known, the known five will allow for the determination of the unknown sixth factor). In standard practice one desires to determine via measurement all of these factors except for CPI_infinite which is calculated. It is also possible to describe subsequent levels of cache or memory hierarchy (L
2
(secondary), L
3
, or memory, disk, etc.). To simplify the discussion, these will not be considered, but a straightforward modification of the equation provides for these. For example:
CPI_finite=CPI_infinite+(L
1
_DC_miss_ratio-L
2
_DC_hit_ratio))*L
1
_DC_miss_penalty+(L
1
_IC_miss_ratio-L
2
_IC_hit_ratio))*L
1
_IC_miss_penalty+L
2
_DC_miss_ratio*L
2
_DC_miss_penalty+L
2
_IC_miss_ratio*L
2
_IC_miss_penalty
In this case, there is the additional detail of the activity of the external cache (sometimes referred to as the L
2
cache). For the purposes of this discussion, this detail will not consider this additional detail, though it is valid and meaningful to do so. In the remainder at this disclosure, the discussion will be restricted to the examination of the influence of L
1
caches only, but it is understood that this discussion applies to any level of memory hierarchy using suitable extensions.
The usual approach in using the linear model is that one determines the factors for a given workload and then considers hardware/software modifications to these factors to understand the effect on the CPI. In particular, CPI_infinite is an estimate of the best case performance of the CPU with an ideal (though possibly very expensive) storage hierarchy and is an important characteristic of the CPU and workload of interest (measurement shows that the behavior of the workload and the CPU can not be separated in any meaningful manner). In particular, one supposes that a different memory subsystem design can reduce the storage access times by some amount. This change in the memory subsystem design will be reflected in the net delays for the various cache miss penalties. Thus, one can recompute the CPI_finite based on the different memory system design.
The rate of progress of the workload on a system depends on the number of instructions that can be executed per second. Since the number of instructions that must be executed is essentially invariant and known, the rate at which instructions execute determines the performance of a given workload on the system of interest.
Assuming that cost of a hypothesized memory system is known, the resultant system c
Levine Frank Eliot
Moore Roy Stuart
Roth Charles Philip
Welbon Edward Hugh
England Anthony V. S.
International Business Machines - Corporation
Kordik Kelly K.
Thai Tuan V.
Winstead Sechrest & Minick P.C.
LandOfFree
Performance monitoring of cache misses and instructions... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Performance monitoring of cache misses and instructions..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Performance monitoring of cache misses and instructions... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2579931