Electrical computers and digital processing systems: processing – Processing control – Branching
Reexamination Certificate
2001-10-01
2003-11-11
El-Hady, Nabil (Department: 2154)
Electrical computers and digital processing systems: processing
Processing control
Branching
C712S233000, C712S238000, C712S239000, C712S240000, C711S202000, C711S203000, C711S204000, C711S205000, C711S206000
Reexamination Certificate
active
06647491
ABSTRACT:
TECHNICAL FIELD OF THE INVENTION
This application relates in general to run-time optimizers, and in specific to a mechanism for instruction profiling and trace selection.
BACKGROUND OF THE INVENTION
A run-time optimizer is an adaptive software system that transparently optimizes applications at run-time. The optimizer rewrites the binary code of an application on-the-fly to achieve a higher execution efficiency.
FIG. 3
depicts prior art run time optimizer
300
. The control loop
310
begins execution of a block of program code, via emulation performed by the profiling emulator
320
. The profiling aspect of emulator
320
allows the control loop
310
to track the number of time the particular block of code has been executed via emulation. Note that a run time optimization system is different from a run time binary translation system, in that the latter is for architecture migration, while the former is to decrease execution time. The run time optimization system is using the emulator
320
for profiling in order to guide optimizations, i.e. the code is running on its native system. After a predetermined number of executions via emulation, the control loop
310
designates the block of code as hot code, and desirable for optimization. The control loop
310
then activates trace selector
330
to translate the block of code. The trace selector
330
forms a trace of the instructions that comprise the block of code by following the instructions in the block. When a branch instruction is encountered, the trace selector makes a prediction as to whether the branch is taken or falls through. If the selector decides the branch is mostly taken, then the trace is formed by extending the code from the branch target block. If the selector decides not to take the branch, then the branch falls through, and the trace continues within the fall through block. The trace terminates at a backward branch predicted to take or when the trace becomes sufficiently large. After the trace is formed, the newly formed trace is optimized for the current processor. The optimized code is then placed into the code cache
340
. The next time the control loop
310
encounters a condition to execute this block of code, then the control loop
310
will execute the code in the code cache
340
and not emulate the code via emulator
320
.
A problem with
FIG. 3
is that an emulator is required to perform profiling, i.e. the emulated code is used to determine which code is hot. Emulation is very slow; usually 50-200 times slower than native execution speed. Consequently, there is a large time penalty for determining which code is hot. Moreover, the quality of optimization is often determined by the quality of the selected trace. Branch mis-prediction in trace construction is costly, for example predicting a branch not to be taken means the remainder of the block code is traced and optimized, and if mis-predicted then that tracing and optimizing of the code subsequent to the branch is wasted. Branch mis-prediction can be minimized by maintaining a long history of branching outcomes, which is formed by continually emulating the code block. Thus, the prior art run time optimization system (RTOS) either incurs a time penalty from emulation to build a good history, or incurs a time penalty from branch mis-prediction.
Another problem with the prior art RTOS is that it attempts to translate any code that is deemed hot based on a small threshold. This problem is referred to as complex and less reliable. There are some traces that are difficult to translate, for example, it is difficult to translate a trace with a branch in the delay slot of another branch, but without a translation, the execution of the trace would be performed by software simulation or emulation. Since emulation is slow, all hot code is translated. The requirement of translating all hot code, including all the difficult traces, increases the translation time and complexity. With this software based approach, it is rather difficult to come up with an ideal threshold value. A higher threshold would incur too much overhead while a lower threshold would end up with traces in poor quality, i.e. too many traces, and traces with early exits. Note that emulation time overhead is such that a higher threshold would require the code to be emulated much longer before they get a chance to be translated.
Consequently, it is difficult for the prior art RTOS to have long cache lines to hold traces. Long cache lines are desirable because the longer the cache line, the higher the cache hit rate. However, this is difficult for the prior art RTS because traces are either inaccurately formed or require a large amount of overhead for profiling. Moreover, the traces that are formed are not reliable because of branch mis-predictions. For example, if the overhead for software based profiling takes 5% of execution time, then the run-time optimization must gain at least 5% of performance in order to break even.
The prior art RTOS has been described in terms of a pure software approach. However, another prior art approach of generating traces is in pure hardware form. However, this approach requires a great deal of complexity in the hardware to form the traces. Particularly, the hardware approach requires an additional cache for holding the trace, i.e. a trace cache. Refer to E. Rotenberg, S. Bennett, and J. E. Smith, “Trace Cache: A Low Latency Approach to High-Bandwidth Instruction Fetch,” Proc. Int'l Symnp. MicroArchitcture, IEEE CS Press, Los Alamitos, Calif., 1996, which is incorporated by reference.
Therefore, there is a need in the art for RTOS that does not require emulation for profiling, produces reliable traces, and uses hot code to form traces.
SUMMARY OF THE INVENTION
These and other objects, features and technical advantages are achieved by a system and method which provides fast profiling and effective trace selection with some micro-architecture support. The instruction execution frequency and branch prediction information are collected at essentially no execution time cost. Therefore, the profiling and trace selection can have a much higher threshold and will select traces with a higher quality.
The inventive mechanism partitions the work between hardware and software. The hardware automatically detects which code is executed very frequently, e.g. which code is hot code. Hardware is better suited to this task than software, because software would require more overhead in making the determination, while hardware incurs essentially zero overhead. Moreover, since hardware executes the branch instructions, the hardware also keeps the branch history information to do branch prediction. When the hardware determines that a section or block of code is hot, e.g. hot code, the hardware sends a signal to the software, which is maintained in firmware. This signal informs the software and lets the software decide which trace to select. Because the trace latching is done by the software, it has more freedom and can handle more complex cases. Furthermore, the software may add some optimizations to hot code, and has the capability to form longer traces and better traces. Software is better suited for these tasks than hardware, because the software has more flexibility, less implementation complexity and is less expensive than hardware.
Therefore, it is a technical advantage of the present invention to have a hardware processor identify frequently used code and use software embedded in firmware predict and select traces from the frequently used code.
It is another technical advantage of the present invention to use the hardware for frequently used code identification and profiling, as hardware requires less overhead in making the determinations than software.
It is a further technical advantage of the present invention to use the software for trace prediction and selection, as software has more flexibility, less complexity and is less expensive than hardware.
It is a still further technical advantage of the present invention to use the instruction cache to maintain counters
Benitez Manuel
Hsu Wei C.
El-Hady Nabil
Hewlett--Packard Development Company, L.P.
LandOfFree
Hardware/software system for profiling instructions and... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Hardware/software system for profiling instructions and..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Hardware/software system for profiling instructions and... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-3139681