Predecode in parallel with TLB compare

Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C711S140000, C711S205000, C711S128000, C711S206000, C711S207000, C712S213000

Reexamination Certificate

active

06591343

ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
This invention relates in general to the field of instruction execution in computers, and more particularly to an apparatus and method for predecoding macro instructions prior to translation.
2. Description of the Related Art
The architecture of a present day pipeline microprocessor consists of a path, or channel, or pipeline that is divided into stages. Each of the pipeline stages performs specific tasks related to the accomplishment of an overall operation that is directed by a programmed instruction. Software application programs are composed of sequences of macro instructions. As a macro instruction enters the first stage of the pipeline, certain tasks are accomplished. The macro instruction is then passed to subsequent stages for the execution of subsequent tasks. Following completion of a final task, the instruction completes execution and exits the pipeline. Execution of programmed instructions by a pipeline microprocessor is very much likened to the manufacture of items on an assembly line.
The efficiency of any assembly line is beneficially impacted by the following two factors: 1) keeping each stage of the assembly line occupied with work and 2) ensuring that the tasks performed within each stage are equally balanced, that is, optimizing the line so that no one stage creates a bottleneck. These same factors can also be said to affect the efficiency of a pipeline microprocessor. Consequently, it is incumbent upon microprocessor designers 1) to provide logic within each of the stages that maximizes the probability that none of the stages in the pipeline will sit idle and 2) to distribute the tasks among the architected pipeline stages such that no one stage will be the source of a bottleneck in the pipeline. Bottlenecks, or pipeline stalls, cause delays in the execution of application programs.
The first stage of a pipeline microprocessor, the fetch stage, performs the task of retrieving macro instructions from memory devices external to the microprocessor. External memory in a desktop computer system typically takes the form of random access memory (RAM), a device technology that is significantly slower than logic within the microprocessor itself. Hence, to access external memory each time an instruction is required for execution would create an overwhelming bottleneck in the fetch stage. For this reason, a present day microprocessor transfers large blocks instructions from external memory into a smaller, yet significantly faster, memory device that resides within the microprocessor chip itself. This internal memory device is referred to as an instruction cache. The large blocks of memory, known as cache lines, are transferred in parallel bursts rather than one byte at a time, thus alleviating some of the delays associated with retrieving instructions from external memory. Ideally, when an instruction is required for execution, it is desirable to find that the instruction has already been transferred into the instruction cache so that it may immediately be forwarded to the next stage in the pipeline. Finding a requested instruction within the instruction cache is referred to as a cache hit. A cache miss occurs when the requested instruction is not found within the cache and the pipeline must be stalled while the requested instruction is retrieved from external memory. Virtually all present day microprocessors have an on-board instruction cache, the average cache size being approximately 64 KB.
The next stage of a present day pipeline, the translate (or decode) stage, deals with converting a macro instruction into a sequence of associated micro instructions for execution by subsequent stages of the microprocessor. Macro instructions specify high-level operations such as arithmetic operations, Boolean logic operations, and data load/store operations-operations that are too complex to be performed within one given stage of a pipeline. Because of this, the macro instructions are decoded and functionally decomposed by logic in the translate stage into the sequence of micro instructions having sub-tasks which can be efficiently executed within each of the pipeline stages, thus precluding bottlenecks in subsequent stages of the pipeline. Decoded micro instructions are then issued sequentially to the subsequent stages for execution.
The format and composition of micro instructions for a particular microprocessor are unique to that particular microprocessor design and are hence tailored to execute very efficiently on that microprocessor. In spite of this, the translation macro instructions into micro instructions without causing undue pipeline delays persists as a significant challenge to microprocessor designers. More specifically, translation of x86 macro instructions is particularly difficult and time consuming, primarily because x86 instructions can vary from 1 to 15 bytes in length and their opcode bytes (i.e., the bytes that provide the essential information about the format of a particular instruction) can follow up to four optional prefix bytes. One skilled in the art will agree that marking boundaries between macro instructions and designating the bytes containing opcodes is a task that is common to the translation of all macro instructions. This task of determining initial information about macro instructions is referred to as predecoding.
As macro instruction sets continue to grow, exemplified by the addition of MMX® instructions to the x86 instruction set in the late 1990's, the operations (and attendant clock cycles) required to decode these instructions has caused attention to be directed again to overcoming bottlenecks in the translate stage. Consequently, to more evenly balance the operations performed within stages of the pipeline, more recent microprocessor designs have shifted the predecoding operation up into the fetch stage.
There are two techniques used today to predecode macro instructions in the fetch stage. The first technique, employed within the Intel Pentium® II/III series of microprocessors, performs the predecoding operation following retrieval of the bytes of a macro instruction from the instruction cache. Accordingly, predecode logic generates a predecode field corresponding to each byte of the macro instruction and provides these fields along with the bytes in a macro instruction queue. Translation logic then retrieves the instruction bytes and predecode fields from the queue as required. Under some conditions, the time required to perform predecoding in this manner is actually transparent to the pipeline because the translation logic is still able to access bytes from the queue for translation while subsequent bytes are being predecoded. But when the queue is empty, the pipeline must be stalled until predecoding completes.
A second technique for predecoding is believed to be employed within Advanced Micro Device's K6® series of microprocessors. This second technique performs predecoding prior to inserting bytes of a macro instruction into the instruction cache. Accordingly, the time required to predecode instruction bytes is absorbed into the time required to retrieve cache lines from external memory. Predecode information fields corresponding to each instruction byte fetched from memory must then be stored alongside each instruction byte in the instruction cache. Hence, although this second predecoding technique may alleviate potential bottlenecks in the fetch stage, it requires a significantly larger instruction cache than would otherwise be needed.
Neither of the two above techniques sufficiently addresses the predecoding problem. The first approach still can present stalls in the pipeline because predecoding is not performed in parallel with some other function in the fetch stage. The second approach requires a significantly larger cache, which results in more complex and costly parts.
Therefore, what is needed is a predecoding apparatus in a pipeline microprocessor that performs predecoding in parallel with another operation in the fetch stage.
In addition, what is needed is

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Predecode in parallel with TLB compare does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Predecode in parallel with TLB compare, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Predecode in parallel with TLB compare will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3045086

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.