Electrical computers and digital processing systems: memory – Storage accessing and control – Hierarchical memories
Reexamination Certificate
2000-12-15
2002-11-12
Lane, Jack A. (Department: 2186)
Electrical computers and digital processing systems: memory
Storage accessing and control
Hierarchical memories
C711S004000, C711S005000
Reexamination Certificate
active
06480938
ABSTRACT:
FIELD OF THE INVENTION
Briefly, the present invention relates generally to the field of cache structures, and more particularly, to cache structures for use with variable length instructions that may cross cache line boundaries.
BACKGROUND OF THE INVENTION
Variable length instructions occur not only for CISC processors, but also in very long instruction word (VLIW) architectures with NOP (non-operational instruction) compression. In particular, it is noted that VLIW bundles of instructions must explicitly schedule NOP operations in unused issue slots. In order to reduce code size and better utilize the instruction cache, these NOP's are compressed out. This operation results in variable length instruction bundles. With variable length bundles, some of the bundles may cross cache line boundaries. The processing necessary to handle bundles that cross cache line boundaries adversely affects die area and cycle time.
A straightforward approach to processing instruction bundles that cross cache line boundaries is to cache the bundles in an uncompressed form. This caching can be implemented by uncompressing the bundles as they are loaded into the cache on a miss. See Lowney, P. Freudenberger, S. Karzes, T. Lichtenstein, W. Nix, R. O'Donell, J., Ruttenberg, J., “The Multiflow Trace Scheduling Compiler”,
Journal of Supercomputing
, January 1993, pages 51-142; Wolfe, A. and Chanin, A., “Executing Compressed Programs on An Embedded RISC Architecture”,
International Symposium on Microarchitecture
, December 1992, pages 81-91. Because the uncompressed bundles will be fixed in size, the cache line size can be chosen such that the bundles will never straddle cache line boundaries. This, however, results in reduced cache performance due to NOP instructions occupying cache slots. A second issue with this approach is the mapping of the uncompressed bundles into the cache. Because uncompressed and compressed bundles have different sizes, a mechanism is needed to translate between the PC (program counter) and the main memory addresses.
A second approach to the problem of bundles crossing cache line boundaries is to restrict the bundles to a limited set of sizes. See Beck, G., Yen, D., Anderson, T., “The Cydra 5 Mini Supercomputer: Architecture and Implementation,”
Journal of Supercomputing
, January 1993, pages 143-180; Rau, B., Yen, D., Yen, W., and Towle, R., “The Cydra 5 Departmental Supercomputer: Design Philosophies, Decisions and Trade-offs”,
Computer
, January 1989, pages 12-35. Bundles that fall between the allowed sizes are NOP padded to the next size up. Bundles within a cache line are restricted to be either all the same size or a limited combination of sizes. The advantage of this approach is that it is relatively simple. The disadvantage of the approach is that it limits the amount of NOP compression, both in main memory and in the cache. This results in both a larger code size and a reduction of cache performance.
Another common approach to the problem is to not allow variable length bundles to cross cache line boundaries. Bundles that cross cache line boundaries are either moved to the next line with NOP padding (see Conte, T., Banerjia, S., Larin, S., Menezes, K., and Sathaye, S., “Instruction Fetch Mechanisms for VLIW Architecture with Compressed Encodings,”
Symposium on Microarchitecture
, December 1996, pages 201-211). This design results in a reduction in code compression in both memory and cache.
A fourth approach to the crossing of cache line boundaries is to use a banked cache structure. Typically, banked caches are implemented by splitting the cache into two physical pieces, one for the even cache lines and the other for the odd cache lines. See Banerjia, S., Menezes, K., and Conte, T., “Next P.C. computation for a Banked Instruction Cache for a VLIW Architecture with a Compressed Encoding”, Technical Report, Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, N.C. 27695-7911, June 1996; and the Conte et al. reference noted earlier. By using this approach, adjacent cache lines can always be accessed simultaneously. A banked instruction cache is implemented in the AMD K5 processor. See Christie, D., “Developing the AMD-K 5 Architecture”, IEEE Micro, April 1996, pages 16-26. The disadvantage to this banked approach is that data is distributed between the two banks on a cache line basis. Since cache lines are long and bundles can start at any location within either bank's line, bus routing and multiplexing are costly. For example, a bank cache structure with 32 syllable lines has 64 possible bundle starting positions (32 in each bank). A four issue, 32-bit per syllable machine with this line length would require a 128-bit 64-to-1 multiplexer to index the desired bundle. This is costly both in terms of die area and cycle time. In addition, the implementation typically places an incrementor on the critical timing path between the PC and the cache. Because of these points, banked caches have been used only sparingly.
A proposal by STMicroelectronics, is to use a single bank cache, latching the current sub-line until all syllables within the sub-line have been used. This frees the instruction cache to fetch the next sub-line if needed to complete the bundle. Intel uses a similar approach in their Pentium Pro Processor. See Gwennap, L., “Intel's P6 User Decoupled Superscalar Design”,
Microprocessor Report
, February 1995, pages 9-15. This single bank cache approach works well for the execution of sequential code segments, effectively mimicking a two bank cache. However, branches to bundles that straddle sub-lines result in a stall due to two sub-lines being needed. Because branches occur frequently and the probability that the target will straddle line boundary is great, the degradation in performance is significant.
SUMMARY OF THE INVENTION
Briefly, in one aspect the present invention comprises a cache structure, organized in terms of cache lines, for use with variable length bundles of instructions (syllables), including: a first cache bank that is organized in columns and rows; a second cache bank that is organized in columns and rows; logic for defining the cache line into a sequence of equal sized segments, and mapping alternate segments in the sequence of segments to the columns in the cache banks such that the first bank holds even segments and the second bank holds odd segments; logic for storing bundles across at most a first column in the first cache bank and a sequentially adjacent column in the second cache bank; and logic for accessing bundles stored in the first and second cache banks.
In a further aspect of the present invention, each of the segments is at least the size of a maximum issue width for the cache structure.
In a yet further aspect of the present invention, the storing logic includes logic for storing the bundles across line boundaries.
In a further aspect of the present invention, the accessing logic comprises: a first select logic for selecting a first column in the first cache bank based on first information from a program counter; and a second select logic for selecting a sequentially adjacent column, relative to the column selected in the first cache bank, in the second cache bank based on the first information from the program counter.
In yet a further aspect of the present invention, one of the first and second select logics includes an incrementor to selectively increment the first information from the program counter.
In yet a further aspect of the present invention, the accessing logic comprises: third select logic for receiving an input of second information from the program counter; and an bundle extraction multiplexer receiving a segment from each of the first cache bank and the second cache bank and outputting a composite segment formed from one or both of the received segments in accordance with a selector output from the third select logic.
In yet a further aspect of the present invention, an output from a decoder for one of the first and second cache bank is rotated by o
Hewlett--Packard Company
Lane Jack A.
LandOfFree
Efficient I-cache structure to support instructions crossing... does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Efficient I-cache structure to support instructions crossing..., we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Efficient I-cache structure to support instructions crossing... will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2968392