Processor and data cache with data storage unit and tag...

Electrical computers and digital processing systems: memory – Storage accessing and control – Access timing

Reexamination Certificate

Rate now

[ 0.00 ] – not rated yet Voters 0 Comments 0

Details Processor and data cache with data storage unit and tag... Processor and data cache with data storage unit and tag...

: 1996-11-13
: 2003-10-07
: Gossage, Glenn (Department: 2187)
: Electrical computers and digital processing systems: memory
: Storage accessing and control
: Access timing

: C711S118000, C713S500000, C713S501000, C713S600000
: Reexamination Certificate
: active
: 06631454
: ABSTRACT:

BACKGROUND OF THE INVENTION
1. Field of the Invention
The present invention relates generally to cache memories, and specifically to a data cache whose data storage and tag logic and hit/miss logic are split across multiple clock domains.
2. Background of the Prior Art
FIG. 1
illustrates a microprocessor
100
according to the prior art. The microprocessor includes an input/output (I/O) ring which operates at a first clock frequency, and an execution core which operates at a second clock frequency. For example, the Intel Corporation's (Santa Clara, Calif.)486DX2 (hereinafter referred to as DX2) may run its I/O ring at 33 megahertz (MHz) and its execution core at 66 MHz for a 2:1 ratio (1/2 bus), the Intel Corporation's (Santa Clara, Calif.) DX4 may run its I/O ring at 25 MHz and its execution core at 75 MHz for a 3:1 ratio (1/3 bus), and the Intel Pentium® OverDrive® processor, by Intel Corporation (Santa Clara, Calif.) may operate its I/O ring at 33 MHz and its execution core at 82.5 MHz for a 2.5:1 ratio (5/2 bus).
A distinction may be made between “I/O operations” and “execution operations”. For example, in the DX2, the I/O ring performs I/O operations such as buffering, bus driving, receiving, parity checking, and other operations associated with communicating with the off-chip world, while the execution core performs execution operations such as addition, multiplication, address generation, comparisons, rotation and shifting, and other “processing” manipulations.
The processor
100
may optionally include a clock multiplier. With the clock multiplier, the processor can automatically set the speed of its execution core according to an external, slower clock provided to its I/O ring. This may reduce the number of pins needed. Alternatively, the processor may include a clock divider, in which case the processor sets the I/O ring speed responsive to an external clock provided to the execution core.
These clock multiply and clock divide functions are logically the same for the purposes of this invention, so the term “clock mult/div” will be used herein to denote either a multiplier or divider as suitable. The skilled reader will comprehend how external clocks may be selected and provided, and from there multiplied or divided. Therefore, specific clock distribution networks, and the details of clock multiplication and division, will not be expressly illustrated. Furthermore, the clock mult/div units need not necessarily be limited to integer multiple clocks, but can perform e.g. 2:5 clocking. Finally, the clock mult/div units need not necessarily even be limited to fractional bus clocking, but can, in some embodiments, be flexible, asynchronous, and/or programmable, such as in providing a P/Q clocking scheme.
The basic motivation for increasing clock frequencies in this manner is to reduce instruction latency. The execution latency of an instruction may be defined as the time from when its input operands must be ready for it to execute until its result is ready to be used by another instruction. Suppose that a part of a program contains a sequence of N instructions, I
1
, I
2
, I
3
, . . . , I
N
. Suppose that I
n+1
requires, as part of its inputs, the result of I
n
, for all n, from 1 to N−1. This part of the program may also contain any other instructions. Then we can see that this program cannot be executed in less time than T=L
1
+L
2
+L
3
+ . . . +L
N
, where L
n
is the latency of instruction I
n
, for all n from 1 to N. In fact, even if the processor was capable of executing a very large number of instructions in parallel, T remains a lower bound for the time to execute this part of this program. Hence to execute this program faster, it will ultimately be essential to shorten the latencies of the instructions.
We may look at the same thing from a slightly different point of view. Define that an instruction I
n
is “in flight” from the time that it requires its input operands to be ready until the time when its result is ready to be used by another instruction. Instruction I
n
is therefore “in flight” for a length of time L
n
=A
n
*C where A
n
is the latency, as defined above, of In, but this time expressed in cycles. C is the cycle time. Let a program execute N instructions as above and take M “cycles” or units of time to do it. Looked at from either point of view, it is critically important to reduce the execution latency as much as possible.
The average latency can be conventionally defined as 1/N*(L
1
+L
2
+L
3
+ . . . +L
N
)=C/N*(A
1
+A
2
+A
3
+ . . . +A
N
). Let f
j
be the number of instructions that are in flight during cycle j. We can then define the parallelism P as the average number of instructions in flight for the program or 1/M*(f
1
+f
2
+f
3
+ . . . +f
M
).
Notice that f
1
+f
2
+f
3
+ . . . +f
M
=A
1
+A
2
+A
3
+ . . . +A
N
. Both sides of this equation are ways of counting up the number of cycles in which instructions are in flight, wherein if x instructions are in flight in a given cycle, that cycle counts as x cycles.
Now define the “average bandwidth” B as the total number of instructions executed, N, divided by the time used, M*C, or in other words, B=N/(M*C).
We may then easily see that P=L*B. In this formula, L is the average latency for a program, B is its average bandwidth, and P is its average Parallelism. Note that B tells how fast we execute the program. It is instructions per second. If the program has N instructions, it takes N/B seconds to execute it. The goal of a faster processor is exactly the goal of getting B higher.
We now note that increasing B requires either increasing the parallelism P, or decreasing the average latency L. It is well known that the parallelism, P, that can be readily exploited for a program is limited. Whereas, it is true that certain classes of programs have large exploitable parallelism, a large class of important programs has P restricted to quite small numbers.
One drawback which the prior art processors have is that their entire execution core is constrained to run at the same clock speed. This limits some components within the core in a “weakest link” or “slowest path” manner.
In the 1960s and 1970s, central processing units were developed in which a multiplier or divider co-processor was clocked at a frequency higher than other circuitry in the central processing unit. These central processing units were constructed of discrete components rather than as integrated circuits or monolithic microprocessors. Due to their construction as co-processors, and/or the fact that they were not integrated with the main processor, these units should not be considered as “sub-cores”.
Another feature of some prior art processors is the ability to perform “speculative execution”. This is also known as “control speculation”, because the processor guesses which way control (branching) instructions will go. Some processors perform speculative fetch, and others, such as the Intel Corporation's (Santa Clara, Calif.) Pentium Pro processor, also perform speculative execution. Control speculating processors include mechanisms for recovering from mispredicted branches, to maintain program and data integrity as though no speculation were taking place.
FIG. 2
illustrates a conventional data hierarchy. A mass storage device, such as a hard drive, stores the programs and data (collectively “data”) which the computer system (not shown) has at its disposal. A subset of that data is loaded into memory such as dynamic random access memory (DRAM) for faster access. A subset of the DRAM contents may be held in a cache memory. The cache memory may itself be hierarchical, and may include a level two (L2) cache, and then a level one (L1) cache which holds a subset of the data from the L2. Finally, the physical registers of the processor contain a smallest subset of the data. As is well known, various algorithms may be used to determine what data is stored in

Affiliated with

Sager David J.

Inventor

[ 0.00 ] – not rated yet Voters 0 Comments 0

Also associated with

Blakely , Sokoloff, Taylor & Zafman LLP

Law Firm

[ 0.00 ] – not rated yet Voters 0 Comments 0

Gossage Glenn

Examiner

[ 0.00 ] – not rated yet Voters 0 Comments 0

Intel Corporation

Corporate Assignee

[ 0.00 ] – not rated yet Voters 0 Comments 0

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Processor and data cache with data storage unit and tag... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Processor and data cache with data storage unit and tag..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Processor and data cache with data storage unit and tag... will most certainly appreciate the feedback.

Rate now

Comments { 0 }

Profile ID: LFUS-PAI-O-3123510

All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.

Canada

Charities
Companies
MP Candidates
Patents
Employee Salary Disclosure

World

Places of the World
Scientific Papers

United States

Banks
Companies
Counties
Patents
Employee Salary Disclosure