Method and apparatus for efficient cache mapping of...

Electrical computers and digital processing systems: memory – Addressing combined with specific memory configuration or... – Addressing cache memories

Reexamination Certificate

Rate now

  [ 0.00 ] – not rated yet Voters 0   Comments 0

Details

C711S202000, C711S118000, C711S125000, C711S137000, C711S170000, C710S068000, C712S205000, C712S024000

Reexamination Certificate

active

06581131

ABSTRACT:

FIELD OF INVENTION
The present invention is directed to the efficient utilization of caches in computer architecture. Specifically, the invention is directed to a method and apparatus for efficient cache mapping of compressed cache lines containing Very Long Instruction Word (VLIW) instructions.
BACKGROUND OF THE INVENTION
Very Long Instruction Word (VLIW) architectural instructions are comprised of multiple operations packed into a single very long instruction word. A VLIW processor relies on an optimizing compiler to find useful work to fill operation slots in the VLIW. To do so, the compiler uses tools such as loop unrolling, inlining, and code motion to maximize performance. This comes at a cost of increased code size. In addition, the compiler may not be able to fill all operation slots. Thus, no-ops (no operation) are used as fillers, increasing code size further. This generally results in VLIW code size being larger than other architectures. To combat this, VLIW code may be stored in compressed form in a cache line and decompressed as the cache line is loaded from memory.
Because compressed instructions vary in size, the instruction address (i.e. program counter) is not incremented by a set value. Further, the cache location is indexed by the compressed line address. That is, the lower address bits are used to map lines into the cache. This leads to sequential lines either being mapped to the same cache location or being distributed to non-sequential entries in the cache. Both of these increasing conflict misses and reducing cache utilization, reducing overall cache performance. This problem is further explained with reference to
FIGS. 1 and 2
.
FIG. 1
illustrates an example of inefficient mapping of compressed cache lines into an instruction cache.
FIG. 2
presents a simplified view of a typical instruction cache.
FIG. 1
shows a portion of a main memory
110
in which is stored compressed instruction cache lines
120
having different lengths and being stored at memory locations (addresses)
115
. The figure also shows an instruction decompression unit
125
for decompressing the compressed lines retrieved from main memory
110
. The instruction cache
130
is shown having stored therein the decompressed lines
140
with corresponding instruction tag entries
135
. As can be seen, sequential lines are not in order and are distributed throughout the cache. Also shown are the program counter
145
and a comparator
150
.
FIG. 2
shows the components of a typical instruction cache implementation and corresponds to the cache mapping shown in FIG.
1
. It consists of an Instruction Cache Data RAM
210
in which the decompressed cache lines are stored, an Instruction Cache Tag RAM
215
in which the instruction tag entries are stored, a Program Counter (PC) register
220
, PC increment circuitry
225
, Branch logic
230
, and Cache Control logic (not shown). Also shown in the figure are a comparator
240
and a memory controller
235
for controlling main memory accesses.
In the typical cache implementation of
FIGS. 1 and 2
, the lower bits l of the Program Counter (PC) select an entry in the Instruction Cache Tag and Instruction Data RAMs. The upper bits u of the PC are compared, in the comparator, with the value retrieved from the Cache Tag. If a match, the access is deemed a “hit.” On a hit, the Instruction Cache Line retrieved from the Cache Data RAM is passed to the processor pipeline and the PC is incremented by a set amount n. If not a match, the access is deemed a “miss.” On a miss, the PC is supplied as the memory address to the Memory Controller. The Memory Controller
235
retrieves from main memory the desired cache line from the memory address supplied, and loads it and the upper bits of the PC into the selected cache data and cache tag entries, respectively. The access then proceeds as a “hit.”
A change in the PC sequencing can be performed through a branch instruction. A branch instruction causes the PC to be updated with a Branch Target Address, supplied in the branch instruction as an absolute or PC relative (PC plus Offset) address.
Colwell et. al., in U.S. Pat. Nos. 5,057,837 and 5,179,680 addressed the issue of no-op compression for VLIW instructions. The approach packs the useful operations of an instruction into a variable length compressed instruction. Associated with this compressed instruction is a mask word. Each bit of the mask word corresponds to an operation slot in the VLIW instruction. A mask word bit set to one specifies that a useful operation held in the compressed instruction is mapped to the operation slot. A zero specifies that a no-op occupies the slot. The VLIW instruction is reconstructed during a cache miss by expanding the compressed instruction, inserting no-ops as specified by the mask word. The reconstructed VLIW instruction and the upper bits of the program counter (PC) are loaded into the cache data and tag, respectively, at the line indexed by the lower bits of the PC. The PC is equal to the address of the compressed VLIW instruction in memory. Because the compressed instructions vary in size, the PC is not incremented by a set amount for each instruction. For this reason, the implementation also stores the next PC, computed from the mask word and current PC, into the cache line.
One disadvantage of this method is that the cache location is indexed by the compressed instruction address. As discussed earlier and shown in
FIG. 1
, this leads to a reduction in cache performance. One solution is to either pad or not compress critical instruction sequences. In other words, give up some compression to improve cache performance.
Another proposal is to use a virtual memory style approach in which the PC is incremented by a set value, indexing the cache with its lower bits. On a cache miss, the PC indexes a translation table, accessing the address of the compressed line in memory. The compressed line is then accessed, decompressed, and loaded into the appropriate location in the cache. This achieves efficient mapping of the decompressed lines into cache at the cost of an additional translation table access.
In today's burst memory systems, it is advantageous to minimize multiple random accesses, in favor of a multi-word bursts. A random access has latency of 5 to 15 times that of a sequential burst access. A drawback of the implementation presented above is that it requires an additional translation table access, which cannot be combined with other accesses. This could nearly double the miss penalty in certain implementations.
One proposal to avoid the added cost of table access is to use a Translation Lookaside Buffer (TLB). A TLB, in essence a cache of the translation table, works with the general assumption that the same translation (TLB entry) is used for many page accesses. In the case of compressed cache lines, a translation is associated with each cache line. Thus, a much larger TLB than usual is needed to achieve an effective TLB hit rate. Note, each cache line must be compressed independently of other cache lines. This is needed because even if the cache line does not contain a target of a branch, the line might be replaced in the cache and later reloaded. Cache lines could be compressed together in blocks; however, this will increase the miss penalty because an instruction could only be retrieved by decompressing from the beginning of the block.
An alternative is to allocate several words to each translation table entry. If the compressed cache line fits within the entry, only the translation table is accessed. Otherwise, a pointer to the remaining words of the compressed line is stored in one of the entry words. A critical design choice of this approach is the entry size. To achieve the best compression, the entry size should be as small as possible. To utilize the burst capability of the memory system, the entry size should be sufficiently large (at least 4 memory words, which is 32 bytes in a 64 bit memory system). To minimize average miss penalty, the majority of the instructions executed should fit in the en

LandOfFree

Say what you really think

Search LandOfFree.com for the USA inventors and patents. Rate them and share your experience with other people.

Rating

Method and apparatus for efficient cache mapping of... does not yet have a rating. At this time, there are no reviews or comments for this patent.

If you have personal experience with Method and apparatus for efficient cache mapping of..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Method and apparatus for efficient cache mapping of... will most certainly appreciate the feedback.

Rate now

     

Profile ID: LFUS-PAI-O-3116703

  Search
All data on this website is collected from public sources. Our data reflects the most accurate information available at the time of publication.