Electrical computers and digital processing systems: memory – Storage accessing and control – Control technique
Reexamination Certificate
1998-01-06
2001-11-20
Yoo, Do Hyun (Department: 2185)
Electrical computers and digital processing systems: memory
Storage accessing and control
Control technique
C711S167000, C711S118000, C710S035000, C710S034000, C710S020000, C710S033000
Reexamination Certificate
active
06321310
ABSTRACT:
This invention relates to computer systems, and in particular, but not exclusively, to such systems for processing media data.
An optimal computer architecture is one which meets its performance requirements whilst achieving minimum cost. In a media-intensive appliance system, at present the main hardware cost contributor is memory. The memory must have enough capacity to hold the media data and provide enough access bandwidth in order that the computation throughput requirements can be met. Such an appliance system needs to maximise the data throughput, as opposed to a normal processor which usually has to maximise the instruction throughput. The present invention is concerned in particular, but not exclusively, with extracting high performance from low cost memory, given the restraints of processing media-intensive algorithms.
The present invention relates in particular to a computer system of the type comprising: a processing system for processing data; a memory (provided for example by dynamic RAM (“DRAM”)) for storing data processed by, or to be processed by, the processing system; a memory access controller for controlling access to the memory; and a data buffer (provided for example by static RAM (“SRAM”)) for buffering data to be written to or read from the memory.
At present, the cheapest form of symmetric read-write memory is DRAM. (By symmetric, it is meant that read and write accesses take identical times, unlike reads and writes with Flash memory.) DRAM is at present used extensively in personal computers as the main memory, with faster (and more expensive) technologies such as static SRAM being used for data buffers or caches closer to the processor. In a low cost system, there is a need to use the lowest cost memory that permits the performance (and power) goals to be met. In the making of the present invention, an analysis has been performed of the cheapest DRAM technologies in order to understand the maximum data bandwidths which could be obtained, and it is clear that existing systems are not utilising the available bandwidth. The present invention is concerned with increasing the use of the available bandwidth and therefore increasing the overall efficiency of the memory in such a computer system and in similar systems.
A typical processor can access SRAM cache in 10ns. However, an access to main DRAM memory may take 200 ns in an embedded system, where memory cost needs to be minimised, which is a twentyfold increase. Thus, in order to ensure high throughput, it is necessary to place as much data in the local cache memory block before it is needed. Then, the processor only sees the latency of access to the fast, local cache memory, rather than the longer delay to main memory.
“Latency” is the time taken to fetch a datum from memory. It is of paramount concern in systems which are “compute-bound”, i.e. where the performance of the system is dictated by the processor. The large factor between local and main memory speed may cause the processing to be determined by the performance of the memory system. This case is “bandwidth-bound” and is ultimately limited by the bandwidth of the memory system. If the processor goes fast enough compared to the memory, it may generate requests at a faster rate than the memory can satisfy. Many systems today are crossing from being compute-bound to being bandwidth-bound.
Using faster memory is one technique for alleviating the performance problem. However, this adds cost. An alternative approach is to recognise that existing memory chips are used inefficiently and to evolve new methods to access this memory more efficiently.
A feature of conventional DRAM construction is that it enables access in “bursts”. A DRAM comprises an array of memory locations in a square matrix. To access an element in the array, a row must first be selected (or ‘opened’), followed by selection of the appropriate column. However, once a row has been selected, successive accesses to columns in that row may be performed by just providing the column address. The concept of opening a row and performing a sequence of accesses local to that row is called a “burst”.
The term “burst efficiency” used in this specification is a measure of the ratio of (a) the minimum access time to the DRAM to (b) the average access time to the DRAM. A DRAM access involves one long access and (n−1) shorter accesses in order to burst n data items. Thus, the longer the burst, the more reduced the average access time (and so, the higher the bandwidth). Typically, a cache-based system (for reasons of cache architecture and bus width) will use bursts of four accesses. This relates to about 25 to 40% burst efficiency. For a burst length of 16 to 32 accesses, the efficiency is about 80%, i.e. about double.
The term “saturation efficiency” used in this specification is a measure of how frequently there is traffic on the DRAM bus. In a processor-bound system, the bus will idle until there is a cache miss and then there will be a 4-access burst to fetch a new cache line. In this case, latency is very important. Thug, there is low saturation efficiency because the bus is being used rarely. In a test on one embedded system, a saturation efficiency of 20% was measured. Thus, there is an opportunity of obtaining up to a fivefold increase in performance from the bus.
Combining the possible increases in burst efficiency and saturation efficiency, it may be possible to obtain about a tenfold improvement in throughput for the same memory currently used.
A first aspect of the present invention is characterised by: means for issuing burst instructions to the memory access controller, the memory access controller being responsive to such a burst instruction to transfer a plurality of data words between the memory and the data buffer in a single memory transaction; and means for queueing such burst instructions so that such a burst instruction can be made available for execution by the memory access controller immediately after a preceding burst instruction has been executed.
A second aspect of the invention is characterised by: means for issuing burst instructions to the memory access controller, each such burst instruction including or being associated with a parameter defining a spacing between locations in the memory to be accessed in response to that burst instruction, and the memory access controller being responsive to such a burst instruction to transfer a plurality of data elements between the memory, at locations spaced in accordance with the spacing parameter, and the data buffer in a single memory transaction.
A third aspect of the invention provides a method of operating a computer system as indicated above, comprising: identifying in source code computational elements suitable for compilation to, and execution with assistance of, the at least one data buffer; transforming the identified computational elements in the source code to a series of operations each involving a memory transaction no larger than the size of the at least one data buffer, and expressing such operations as burst instructions; and executing the source code by the processing system, wherein the identified computational elements are processed by the processing system through accesses to the at least one data buffer.
Other preferred features of the invention are defined in the appended claims.
The present invention is particularly, but not exclusively, applicable only for certain classes of algorithm, which will be termed “media-intensive” algorithms. By this, it is meant an algorithm employing a regular program loop which accesses long arrays without any data dependent addressing. These algorithms exhibit high spatial locality and regularity, but low temporal locality. The high spatial locality and regularity arises because, if array item n is used, then it is highly likely that array item n+s will be used, where s is a constant stride between data elements in the array. The low temporal locality is due to the fact that an array item n is typically accessed only once.
Ordinary caches are predominantly desi
McCarthy Dominic Paul
Quick Stuart Victor
Hewlett--Packard Company
McLean Kimberly
Yoo Do Hyun
LandOfFree
Memory architecture for a computer system does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Memory architecture for a computer system, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Memory architecture for a computer system will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2615676