Electrical computers and digital processing systems: processing – Processing control – Branching
Reexamination Certificate
1999-02-18
2002-09-17
Maung, Zarni (Department: 2154)
Electrical computers and digital processing systems: processing
Processing control
Branching
C712S238000, C712S239000, C711S202000, C711S203000, C711S204000, C711S205000, C711S206000
Reexamination Certificate
active
06453411
ABSTRACT:
TECHNICAL FIELD OF THE INVENTION
This application relates in general to run-time optimizers, and in specific to hardware embedded rim-time optimizer.
BACKGROUND OF THE INVENTION
A run-time optimizer is an adaptive software system that transparently optimizes applications at run-time. The optimizer rewrites the binary code of an application on-the-fly to achieve a higher execution efficiency.
FIG. 4
depicts prior art run-time optimizer
30
. The control loop
31
begins execution of a block of program code via emulation performed by the profiling emulator
32
. The profiling aspect of emulator
32
allows the control loop
31
to track the number of times the particular block of code has been executed via emulation. Note that a run-time optimization system is different from a run-time binary translation system, in that the latter is for architecture migration while the former is to decrease execution time. The run-time optimization system is using the emulator
32
for profiling in order to guide optimizations, i.e. the code is running on its native system. After a predetermined number of executions via emulation, the control loop
31
designates the block of code as hot code, and desirable for optimization. The control loop
31
then activates trace selector
33
to translate the block of code. The trace selector
33
forms a trace of the instructions that comprise the block of code by following the instructions in the block. When a branch instruction is encountered, the trace selector makes a prediction as to whether the branch is taken or falls through. If the selector decides the branch is mostly taken, then the trace is formed by extending the code from the branch target block. If the selector decides not to take the branch, then the branch falls through, and the trace continues within the fall through block. The trace terminates at a backward branch predicted to be taken or when the trace becomes sufficiently large. After the trace is completed, the code is rewritten with machine dependent and machine independent optimizations. The optimized code is then placed into the code cache
34
. The next time the control loop
31
encounters a condition to execute this block of code, then the control loop
31
will execute the code in the code cache
34
and not emulate the code via emulator
32
.
As shown in
FIG. 5
, if the target of a branch which is taken to exit trace
1
, as shown by branch instruction
41
, then control is returned to the run-time system RTS
30
and to control loop
31
, which determines if the target resides in the code cache. If the target resides in code cache, then the control loop
31
modifies the target of the branch instruction
41
to be the trace
2
42
in code cache as shown by branch instruction
43
. This modification is called backpatching. Thus, if the exit of the trace is already translated, then the branch is backpatched such that a subsequent execution will directly branch to the new trace without returning to the control loop. Backpatching increases the speed of execution of the code, as returning to the RTS significantly slows down execution time.
A problem with
FIG. 4
is that an emulator is required to perform profiling, i.e. the emulated code is used to determine which code is hot. Emulation is very slow, usually 50-200 times slower than native execution speed. Consequently, there is a large time penalty for determining which code is hot. Moreover, the quality of optimization is often determined by the quality of the selected trace. Poor trace selection can be costly, for example, predicting a branch not to be taken means the remainder of the block code is traced and optimized, and if mispredicted, then that tracing and optimizing of the code subsequent to the branch is wasted. Branch misprediction can be minimized by maintaining a long history of branching outcomes, which is formed by continually emulating the code block. Thus, the prior art RTS either incurs a time penalty from emulation to build a good history or incurs a time penalty from branch misprediction.
Another problem with the prior art RTS is that it cannot backpatch an indirect branch. The RTS cannot backpatch an indirect branch because the target address is unknown. The target address is typically in a register or memory location and not written directly in code. Thus, the RTS will shift control back to the control loop
31
to determine whether the target address has been translated, which is expensive in terms of time. The prior art has attempted to minimize this problem by inlining a code sequence to search a smaller lookup table in the optimized traces, however, these mechanism still incur high overhead. Examples of indirect branches are return branches and switch branches.
A further problem with the prior art RTS is that it attempts to translate any code that is deemed hot based on a small threshold. This problem is referred to as complex and less reliable. There are some traces that are difficult to translate, but, without a translation, the execution of the trace would be performed by software simulation or emulation. Since emulation is slow, all hot code is translated. Some traces are very difficult to translate. For example, it is difficult to translate a trace with branches in the delay slot of another branch. The requirement of translating all hot code increases the translation time and complexity.
A further problem with the prior art RTS is that it will handle only user code and not operating system (OS) code. This is because the RTS is layered between the user application and the OS, and thus will not handle privileged instructions and addressing modes. In the prior art, the RTS is attached to user processes. Since the prior art RTS cannot be attached to the OS, it does not handle OS code.
Therefore, there is a need in the art for a RTS that does not require emulation for profiling, can handle indirect branches without returning control to a control loop, can refuse translation of difficult code and will handle OS code.
SUMMARY OF THE INVENTION
These and other objects, features and technical advantages are achieved by a system and method which embeds the control loop in hardware and, thus, does not require emulation for profiling, can handle indirect branches, will not translate difficult code, and will handle OS code. The inventive run-time optimization system (RTOS) places the control loop in the hardware and the translation/optimization components in the firmware, which are both below the OS level. Hence, the OS code can also be optimization candidates.
The inventive RTOS handles execution profiling and transfers execution to optimized traces automatically. This would allow code to run at faster native speed instead of slower emulation. Since the code is running faster, the threshold for selecting a hot trace could be set much higher than the prior art. This would also avoid generating traces for relatively infrequent code paths. Moreover, a higher threshold would enable the selection of better traces. Thus, a processor desires to execute a block of instructions, the processor first examines the Icache to determine whether the block is present. If not, the block is moved from memory to Icache. When the code is first moved into Icache, a threshold value is set into a counter associated with the particular instruction or instruction bundle (a group of instructions that can be issued together in the same cycle) of the Icache. Each time the instruction or instruction bundle is executed and retired, the counter is decremented by one. When the counter reaches zero, a trap is generated and the instruction (or instruction bundle) is designated as hot code.
After the trap is generated to firmware, a trace selector forms a trace of the hot code. The trace is followed to determine the location of the target, i.e the next instruction. The Icache maintains branch history information for the instructions in each cache line. This branch history is used to determine whether a branch should be predicted (as thus treated) as taken or to fall through. If the branch is
Benitez Manuel
Hsu Wei C.
El-Hady Nabil
Maung Zarni
LandOfFree
System and method using a hardware embedded run-time optimizer does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with System and method using a hardware embedded run-time optimizer, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and System and method using a hardware embedded run-time optimizer will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2885134