Electrical computers and digital processing systems: memory – Storage accessing and control – Shared memory area
Reexamination Certificate
2000-04-04
2002-08-13
Peikari, B. James (Department: 2186)
Electrical computers and digital processing systems: memory
Storage accessing and control
Shared memory area
C710S120000, C710S120000, C710S052000
Reexamination Certificate
active
06434674
ABSTRACT:
FIELD OF THE INVENTION
The present invention relates to computer memory architecture and, more particularly, to high speed, multi-ported, direct data flow memory architecture.
BACKGROUND OF THE INVENTION
Many computer I/O systems utilize a buffered data flow protocol. With such architecture, data is moved into the system through one interface, is buffered temporarily, and is then moved out of another interface. Often the data path is separate from the command/control path, but it need not be.
To increase overall system bandwidth, it is highly desirable to move the data over a given system bus or data pathway only once. Multi-ported memory architectures are used to buffer data, while satisfying this single pathway requirement. Data is input through one port and output through another.
The most common memory of this type is dual-port, but higher numbers of ports are also used. The memories may be implemented to function with component-level buses or with system-level buses, such as PCI. The latter has become a preferred component interconnection means due to the large numbers of personal computers using the bus, and the peripheral devices supporting it.
Multiple port memory architectures are generally implemented today in one of two ways. The first method uses memory composed of true multi-ported memory cells and is found in certain static random access memories (SRAMs). The second method uses a single-ported memory array (typically DRAM of some sort) with a multiplexing scheme that alternately permits one of several ports to access the memory one at a time. If the speed of the memory is significantly higher than that of the ports, and a speed-matching (synchronizing) mechanism is provided, then the memory may appear to be simultaneously accessed by multiple ports. This method is used by most PC bridge chip sets to multi-port a single bank of memory to a processor, a system I/O bus and a video bus.
A true dual-ported SRAM structure has certain advantages, among which are:
a) random accesses may occur on each port simultaneously without causing any access delay at the other port;
b) the initial access to a sequential block of memory takes the same time as do all subsequent accesses, so there is no initial access overhead time penalty; and
c) dual port SRAM designs are inherently simple, since multiplexing is not necessary.
Unfortunately, true dual-ported SRAM memory systems have certain disadvantages, also, among which are:
a) the memory is limited to only two ports;
b) the density of the memory is relatively low, typically requiring many chips per megabyte; and
c) the cost of the memory is very high, typically an order of magnitude more than the cost of DRAM.
The advantages and disadvantages of a multi-ported DRAM architecture, on the other hand, are opposite those of the SRAM. When an application requires more than 100 Kbytes of buffering, the cost of multi-port SRAM becomes prohibitive.
If more than two ports are required, a multiplexing scheme must be implemented, raising the overall cost of the system. In these cases, the most cost effective method of implementing a multi-ported RAM is by using a multiplexed DRAM architecture.
In typical operation, a device reads or writes data in blocks to and from the buffer memory. The size of these blocks varies according to the application. A memory controller usually performs memory accesses in fixed sizes, with the width and depth of the access set to a predetermined burst size. On a read from memory, when an external device uses more than one memory burst of data, the memory controller performs additional accesses as necessary. When the external device uses less than one burst of data, the additional data fetched from memory is discarded.
On a write to memory of one or more full bursts, the controller performs the write operation. If the write is only to a portion of the fixed size block in memory, the controller reads the data block from memory, modifies the correct portions, and then writes the block back. Additional memory bandwidth may be required if complex caching is not performed.
DRAMs typically require several addressing and start-up cycles to begin an access to a consecutive block of data. This overhead requirement is therefore fixed, regardless of the data read or written. As a result, the effective data bandwidth increases with larger data transfer bursts, due to the overhead being averaged over a greater amount of data actually moved. For a single memory port, the longer the burst at a given clock speed, the greater the available bandwidth of the port. When implementing a multi-port memory, however, the metric (parameter) of interest is not the raw bandwidth of a single port. Instead, it is the net bandwidth that is important, across a pair of ports when one port writes data to memory, and the other port reads data from memory. In this situation, a long burst on one port adds an access delay to the other port, which results in additional overhead on that other port. Of course, if the data bursts are always very long, this overhead may be small compared with data volume, resulting in acceptable performance.
If the application does not require large data bursts or uses a mix of large and small bursts, a large memory access size causes a highly inefficient use of the bandwidth, resulting in an unacceptable performance. This inefficiency arises from discarding great amounts of unused data on reads, and reading and then writing great amounts of unmodified data on writes. Reducing the memory access size is advantageous; but, without additional mechanisms for intelligently mapping the variable sized device data transfers to the smaller fixed memory accesses performed by the controller, poor performance results.
As a numerical example, consider a memory with a 6-clock cycle access overhead (i.e., the time from initial request to first data is six clock cycles at the memory clock speed). If two ports request access simultaneously, then one must wait for the other to complete. Assuming a 66 Mhz clock, 32-bit memory width, one clock cycle of port arbitration and the first port winning access, Table I indicates the data bandwidth of the ports as a function of the memory burst size for one burst.
TABLE I
Burst
Port 1
Port 2
Dual
Size
clocks
clocks
Port 2
Port
(32-bit
to move
Port 1 Peak
to move
Peak
Through-
words)
burst
MB/sec
burst
MB/sec
put
1
7
38.1
15
17.8
12.1
2
8
66.7
17
31.4
21.3
4
10
106.7
21
50.8
34.4
8
14
152.4
29
73.6
49.6
16
22
193.9
45
94.8
63.7
32
38
224.6
77
110.8
74.2
64
70
243.8
141
121.0
80.9
128
134
254.7
269
126.9
84.7
256
262
260.6
525
130.0
86.7
512
518
263.6
1037
131.7
87.8
1024
1030
265.1
2061
132.5
88.3
2048
2054
265.9
4109
132.9
88.6
4096
4102
266.3
8205
133.1
88.8
8192
8198
266.5
16397
133.2
88.8
16384
16390
266.6
32781
133.3
88.9
As can be seen, the actual dual port throughput, which is the amount of data that can be moved into port
1
and out of port
2
, is lower than the peak rates of either port. These rates do not include any overhead for the PCI busses to which they are connected. It can also be seen that the throughput levels off and does not approach the 132 MB/sec that a 32-bit PCI bus can sustain. Increasing burst size alone cannot deliver high multi-port throughput.
The throughput numbers shown above also assume that all of the data read from the memory is used. If only a small fraction of the fixed size memory burst data is actually used, then the fixed-size larger bursts may actually waste memory bandwidth; the effective throughput decreases to a greater extent. If the data required and actually used were 512 bytes (128 32-bit words) and the memory burst size were set to 256 32-bit words, for example, then the effective throughput would be one half of that shown. If all of the PCI transfers are not at least as large as the memory burst, then bandwidth is wasted and effective throughput decreases.
Another problem arises when PCI-delayed read transactions are performed. The PCI specification requires a target to issue a retry instruction if the latency for a read access is significant (i.e.,
DeWilde Mark
Stone Stephen
Advanced Digital Information Corporation
Peikari B. James
Salzman & Levy
LandOfFree
Multiport memory architecture with direct data flow does not yet have a rating. At this time, there are no reviews or comments for this patent.
If you have personal experience with Multiport memory architecture with direct data flow, we encourage you to share that experience with our LandOfFree.com community. Your opinion is very important and Multiport memory architecture with direct data flow will most certainly appreciate the feedback.
Profile ID: LFUS-PAI-O-2922976