Method and apparatus for improving caching within a...

Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C711S123000, C711S128000

Reexamination Certificate

active

06449693

ABSTRACT:

FIELD OF THE INVENTION
The present invention relates to processor systems and more specifically to a method and apparatus for improving caching within a processor system.
BACKGROUND OF THE INVENTION
Typical processor designs include an on-chip, “level-1” cache (“L1 cache”) for fast access to the contents (e.g., data or instructions, hereinafter “information”) of the most recently used memory locations. Many processors can access and use L1 cache contents in a single central processing unit (CPU) cycle (hereinafter “cycle”) rather than in the two or more cycles required for accessing an off-chip, “level-2” cache (“L2 cache”). Access to the contents of system memory requires even more cycles.
Recent advances in semiconductor manufacturing technologies and processor design techniques have produced highly complex CPU microarchitectures coupled with large L1 caches that improve many aspects of CPU performance (e.g., processor speed). However, increased L1 cache size has rendered single-cycle L1 cache access difficult. For example, as a cache's size is increased, additional address bits from the address are required to directly access the information stored within the cache, and a larger decoder is required to decode the additional address bits. A larger decoder is inherently slower than a smaller decoder due to additional gate delays in the decode path of the larger decoder, and due to additional loading of each address line that drives an input of the larger decoder. Thus, a larger L1 cache has a longer decode time than a smaller L1 cache.
One technique for reducing the increased decode delay of a larger L1 cache is to increase the cache's associativity (e.g., the number of lines per cache row). For example, a 64 kilobyte (“K”), eight-way set associative cache with 32-byte lines stores eight 32-byte lines per cache row (e.g., in eight different “array cells”) for a total of 256 bytes per cache row, and 256 cache rows per cache. Therefore, only an 8-bit address decoder (e.g., 2
8
=256) is required to access the 256 cache rows instead of an 11-bit address decoder if only one 32-byte line per cache row was employed (e.g., a “single-set” associative cache). Decode delay thereby is reduced.
While increasing cache associativity decreases decoder size, each decoder output must drive additional array cells (e.g., eight arrays cells per cache row for an 8-way set associative cache). Buffering may mitigate loading effects, but buffer circuitry itself creates additional delays. Further, once a cache row is identified via a decode operation, the cache must determine whether the identified cache row actually contains the desired information within one of the cache row's array cells, and if so, in which array cell the information resides (e.g., via tag compare and select operations). These determinations may cause additional cache access delays.
In addition to decode delays, tag compare delays and select delays, the increased physical dimensions of a large L1 cache contribute to cache access delay by increasing the cache's internal wiring lengths (e.g., increasing signal propagation times). High-performance CPUs which have large L1 caches typically employ additional, and often more complex requesters such as execution units, instruction fetch units and the like. The increased size and number of requestors that must interface a large L1 cache makes placement of the requesters near cache input and output ports difficult, increases external wiring lengths and thus further increases cache access time. Cache arbitration among multiple requesters accessing the larger L1 cache also increases cache access time.
The delays associated with larger decoders, tag compare and select operations, increased wiring lengths and cache arbitration, as well as other delays, combine to make cache access the timing bottleneck for most processor designs employing large L1 caches. Accordingly, a need exists for a method and apparatus for improving caching within a processor system by reducing the pressure on cache access time.
SUMMARY OF THE INVENTION
To overcome the needs of the prior art, an inventive processor system is provided. The inventive processor system comprises a plurality of level-0 (L0) caches, a processor having a plurality of execution units, and an L1 cache for caching any data and instructions used by the processor. The L1 cache and the L0 caches preferably are internal to the processor, although external caches may be employed. A portion of the execution units provided are configured so that each execution unit within the portion accesses one of the L0 caches. Each of the L0 caches is accessible by only one of the portion of the execution units, and each L0 cache caches a subset of any data used by the processor which is not cacheable by any of the other L0 caches.
The processor system preferably comprises an instruction dispatcher that dispatches instructions executable by the processor and that selectively designates data as cacheable by only one of the L0 caches. The designation of data as cacheable by only one of the L0 caches preferably occurs at the time instructions are dispatched by the instruction dispatcher (i.e., at dispatch time). For example, an instruction dispatch circuit may be provided that designates data as cacheable by only one of the L0 caches based on a portion of a linear address for the data.
A significant advantage of the inventive processor system is that each L0 cache is associated with (e.g., is “tightly coupled” to) only one execution unit so that L0 cache design is greatly simplified. For example, because each L0 cache is accessed by only one execution unit, arbitration for L0 cache access is not required (e.g., cache arbitration circuitry within each L0 cache is unnecessary), and cache access occurs at the fastest possible speeds (e.g., is not limited by arbitration delays). Further, because memory locations are not shared between L0 caches, L0 cache resources are maximized (e.g., all L0 cached data is non-duplicative data). The addresses assigned to the L0 caches may be assigned without regard for the current thread or task so that assigning and managing task algorithms are not required; and the small size of the L0 caches allows the L0 caches to be located near its associated execution unit (e.g., reducing wiring lengths and thus signal propagation delays).
Other objects, features and advantages of the present invention will become more fully apparent from the following detailed description of the preferred embodiments, the appended claims and the accompanying drawings.


REFERENCES:
patent: 4371929 (1983-02-01), Brann et al.
patent: 4905141 (1990-02-01), Brenza
patent: 5357623 (1994-10-01), Megory-Cohen
patent: 5442747 (1995-08-01), Chan et al.
patent: 5553276 (1996-09-01), Dean
patent: 5636110 (1997-06-01), Lanni
patent: 5745778 (1998-04-01), Alfieri
patent: 6038645 (2000-03-01), Nanda et al.

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for improving caching within a... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for improving caching within a..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for improving caching within a... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-2875887

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.