Profile driven code motion and scheduling

Data processing: software development – installation – and managem – Software program development tool – Translation of code

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C707S793000, C709S241000, C709S241000, C712S233000, C712S240000

Reexamination Certificate

active

06594824

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to software compiler systems used by processors and computing devices. More particularly, the invention relates to a method and apparatus for optimizing object-code sequencing by such a compiler.
2. Description of the Related Art
Processors can be made faster either by using faster circuitry or by increasing the amount of simultaneous computation they perform. The degree of simultaneous computation may be increased by including instruction-level parallelism (ILP), which exploits the inherent parallelism that most programs naturally contain. The basis of ILP is that the order in which operations are performed by the processor may be rearranged so that there is overlap in the execution of operations. A simple example is where the operation of locating a memory address is performed while the value to be stored at that address is computed. Significant improvements in efficiency can be achieved by appropriate rearrangement of operations.
Some ILP processors, such as Superscalar processors, reorder operations that are known to depend on each other because of the hardware. Other processors, such as Very Long Instruction Word (VLIW) processors or Explicitly Parallel Instruction Computing (EPIC) processors, rely on the instructions themselves to express parallelism. An advantage of ILP exploitation is that individual users are not required to rewrite programs to make use of the parallelism; instead, compilation techniques are invoked automatically to map a program's ILP to the parallelism available for the target architecture.
The approaches that have traditionally been taken for reorganizing the instruction set, referred to generally as “instruction scheduling,” may be categorized as either local or global scheduling, such terminology depending on the organization of the instruction sequence into “basic blocks.” A basic block is a straight-line sequence of intermediate instructions having a single entry and a single exit, with no internal branches. Thus, each operation specified in a basic block will be executed once each time execution enters the block. Local scheduling refers to reorganization of instructions only within individual basic blocks while global scheduling encompasses the rearrangement of instructions between multiple basic blocks. While local scheduling may have been adequate for early processors, it is generally insufficient for modern architectures because it is inherently unable to exploit parallelism between basic blocks. This is particularly relevant for an EPIC architecture, which requires global scheduling to achieve adequate improvements.
There are several approaches to global scheduling that extract parallelism from a program by moving operations across basic block boundaries and inserting compensation copies to maintain program semantics. One is the use of a trace scheduling algorithm, which attempts to optimize the most frequently used program paths at the expense of less frequently used paths. This technique reduces the problem to a local-scheduling problem and is described, for example, in P. Geoffrey Lowney et al., “The Multiflow Trace Scheduling Compiler,”
J. Supercomputing
7, 51 (1993) and Joseph A. Fisher, “Trace Scheduling: A Technique for Global Microcode Compaction,”
IEEE Trans. Comps
. C-30, 478 (1981). The technique functions by identifying a frequently executed acyclic path, i.e. a “trace,” in the flow of operations and then allowing operations to move past branches or labels within that path. Corrections to the code are then inserted to compensate for such movement where basic blocks branch into the middle of traces or are branched into from traces. In particular, operations that are moved from one basic block on the trace to another basic block on the trace need to be copied to certain basic blocks that are not on the trace. While the technique reduces the overall number of instructions in the trace, this may be at the expense of increasing the number of instructions in program segments that are not part of the trace. Since those program segments that are not part of the trace are executed less frequently, the scheduling results in an overall improvement in program execution time.
Another approach to global scheduling is percolation scheduling as described, for example, in A. Nicolau, “Percolation Scheduling: A Parallel Computation Technique,” Tech. Rep, Cornell Univ. (1984). This technique is based on a set of core transformations applicable to a program graph. It is a “greedy” algorithm that increases parallelism by moving operations upwards in the graph as much as possible, i.e. as close as possible to the graph node that represents the particular function being compiled. Because the technique ignores system requirements, operations that are executed with small probability nevertheless consume significant system resources. An attempt to address this deficiency is included in enhanced percolation scheduling, where operation movement is delayed until scheduling time. Enhanced percolation scheduling is described in, for example, K. Ebcioglu and A. Nicolau, “A Global Resource-Constrained Parallelization Technique,”
Proc.
3
d Int'l Conf. Supercomp.,
154 (1989).
Like trace scheduling, percolation scheduling requires that compensatory code corrections be introduced. Other related techniques that also require compensatory code corrections include sentinel scheduling and region scheduling. Sentinel scheduling uses profiling information to schedule discrete multiple basic blocks (“superblocks”) and is described, for example, in S. A. Mahlke et al., “Sentinel scheduling for VLIW and superscalar processors,”
asplos
5 27, 238 (1992). Region scheduling, in which program transformation rules are used in an attempt to balance the program graph, is described, for example, in R. Gupta and M. Soffa, “Region scheduling: An approach for detecting and redistributing parallelism,”
IEEE Trans. Software Eng.
16, 421 (1990).
Global instruction scheduling differs from these in that it attempts to limit code duplication. While operations may be moved beyond basic block boundaries, the technique constrains the movement of operations to be within an enclosing loop. The technique is described, for example, in D. Bernstein and M. Rodeh, “Global instruction scheduling for superscalar machines,”
Conf. Prog. Lang. Design and Implementation
, SIGPLAN ′91, p. 241 (1991).
A further approach to global scheduling is described in U.S. Pat. No. 5,557,761, issued to Chan et al. The technique presented there operates by first selecting source and target basic blocks, which are identified according to the relationships of the instructions within them. An attempt is made to identify a maximal set of instructions contained in the source basic block that may by moved to the target basic block without violating any data dependency relationships and remaining within the confines of available resources. A model (called the “Latency Subsumption Cost of Movement Model”) is used to specify the overall cost of such movement, which is then used to decide whether to move the identified maximal instruction set.
SUMMARY OF THE INVENTION
The present invention is directed to a method and apparatus for generating an optimized intermediate representation of source code for a computer program. The optimized intermediate representation is generated from an initial intermediate representation, which is extracted from the source code by organizing it as a plurality of basic blocks. Each of the basic blocks contains at least one program instruction. The basic blocks in the initial intermediate representation are ordered according to an estimated profit value that is assigned to each of the basic blocks. A goal function, which is designed as an appropriate measure of the degree of optimization of the program according to its intermediate representation, is then calculated. The effect on the goal function of modifying the intermediate representation by moving an instruction from one of the ba

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Profile driven code motion and scheduling does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Profile driven code motion and scheduling, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Profile driven code motion and scheduling will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3063874

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.