Electrical computers and digital processing systems: memory – Storage accessing and control – Control technique
Reexamination Certificate
1996-08-26
2001-10-02
Ellis, Kevin L. (Department: 2185)
Electrical computers and digital processing systems: memory
Storage accessing and control
Control technique
C711S131000, C711S140000
Reexamination Certificate
active
06298423
ABSTRACT:
BACKGROUND OF THE INVENTION
The present invention relates to microprocessors, and, more particularly, to providing microprocessors with high performance data caches and load/store functional units.
Microprocessors are processors which are implemented on one or a very small number of semiconductor chips. Semiconductor chip technology is ever increasing the circuit densities and speeds within microprocessors; however, the interconnection between the microprocessor and external memory is constrained by packaging technology. Though on-chip interconnections are extremely cheap, off-chip connections are very expensive. Any technique intended to improve microprocessor performance must take advantage of increasing circuit densities and speeds while remaining within the constraints of packaging technology and the physical separation between the processor and its external memory. While increasing circuit densities provide a path to evermore complex designs, the operation of the microprocessor must remain simple and clear for users to understand how to use the microprocessor.
While the majority of existing microprocessors are targeted toward scalar computation, superscalar microprocessors are the next logical step in the evolution of microprocessors. The term superscalar describes a computer implementation that improves performance by a concurrent execution of scalar instructions. Scalar instructions are the type of instructions typically found in general purpose microprocessors. Using today's semiconductor processing technology, a single processor chip can incorporate high performance techniques that were once applicable only to large-scale scientific processors. However, many of the techniques applied to large scale processors are either inappropriate for scalar computation or too expensive to be applied to microprocessors.
A microprocessor runs application programs. An application program comprises a group of instructions. In running the application program, the processor fetches and executes the instructions in some sequence. There are several steps involved in the executing even a single instruction, including fetching the instruction, decoding it, assembling its operands, performing the operations specified by the instruction, and writing the results of the instruction to storage. The execution of instructions is controlled by a periodic clock signal. The period of the clock signal is the processor cycle time.
The time taken by a processor to complete a program is determined by three factors: the number of instructions required to execute the program; the average number of processor cycles required to execute an instruction; and, the processor cycle time. Processor performance is improved by reducing the time taken, which dictates reducing one or more of these factors.
One way to improve performance of the microprocessor is by overlapping the steps of different instructions, using a technique called pipelining. To pipeline instructions, the various steps of instruction execution are performed by independent units called pipeline stages. Pipeline stages are separated by clocked registers. The steps of different instructions are executed independently in different pipeline stages. Pipelining reduces the average number of cycles required to execute an instruction, though not the total amount of time required to execute an instruction, by permitting the processor to handle more than one instruction at a time. This is done without increasing the processor cycle time appreciably. Pipelining typically reduces the average number of cycles per instruction by as much as a factor of three. However, when executing a branch instruction, the pipeline may sometimes stall until the result of the branch operation is known and the correct instruction is fetched for execution. This delay is known as the branch-delay penalty. Increasing the number of pipeline stages also typically increases the branch-delay penalty relative to the average number of cycles per instruction.
Another way to improve processor performance is to increase the speed with which the microprocessor assembles the operands of an instruction and writes the results of the instruction; these functions are referred to as a load and a store, respectively. Both of these functions depend upon the microprocessor's use of its data cache.
During the development of early microprocessors, instructions took a long time to fetch compared to the execution time. This motivated the development of complex instruction set computer (CISC) processors. CISC processors were based on the observation that given the available technology, the number of cycles per instruction was determined mostly by the number of cycles taken to fetch the instruction. To improve performance, the two principal goals of the CISC architecture were to reduce the number on instructions needed for a given task and to encode these instructions densely. It was acceptable to accomplish these goals by increasing the average number of cycles taken to decode and execute an instruction because using pipelining, the decode and execution cycles could be mostly overlapped with a relatively lengthy instruction fetch. With this set of assumptions, CISC processors evolved densely encoded instructions at the expense of decode and execution time inside the processor. Multiple-cycle instructions reduced the overall number of instructions and thus reduced the overall execution time because they reduced the instruction fetch time.
In the late 1970's and early 1980's, memory and packaging technology changed rapidly. Memory densities and speed increased to the point where high speed local memories called caches could be implemented near the processor. Caches are used by the processor to temporarily store instructions and data. When instructions are fetched more quickly using caches, the performance is limited by the decode and execution time that was previously hidden within the instruction fetch time. The number of instructions does not affect performance as much as the average number of cycles taken to execute an instruction.
The improvement in memory and packaging technology, to the point where instruction fetching did not take much longer than instruction execution, motivated the development of reduced instruction set computer (RISC) processors. To improve performance, the principal goal of a RISC architecture is to reduce the number of cycles taken to execute an instruction, allowing some increase in the total number of instructions. The trade-off between cycles per instruction and the number of instructions is not one to one. Compared to CISC processors, RISC processors typically reduce the number of cycles per instruction by factors of three to five, while they typically increase the number of instructions by thirty to fifty percent. RISC processors rely on auxiliary features such as a large number of general purpose registers, and instruction and data caches to help the compiler reduce the overall instruction count or to help reduce the number of cycles per instruction.
A typical RISC processor executes one instruction on every processor cycle. A superscalar processor reduces the average number of cycles per instruction beyond what is possible in a pipelined scalar RISC processor by allowing concurrent execution of instructions in the same pipeline stage as well as concurrent execution of instructions in different pipeline stages. The term superscalar emphasizes multiple concurrent operations on scalar quantities as distinguished from multiple concurrent operations on vectors or arrays as is common in scientific computing.
While superscalar processors are conceptually simple, there is more to achieving increased performance than widening a processor's pipeline. Widening the pipeline makes it possible to execute more than one instruction per cycle but there is no guarantee that any given sequence of instructions can take advantage of this capability. Instructions are not independent of one another but are interrelated; these interrelationships prevent some instruction
Chinnakonda Murali
Johnson William M.
Witt David B.
Advanced Micro Devices , Inc.
Ellis Kevin L.
Skjerven Morrill & MacPherson LLP
Terrile Stephen A.
LandOfFree
High performance load/store functional unit and data cache does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with High performance load/store functional unit and data cache, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and High performance load/store functional unit and data cache will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2575486