Electrical – Theoretical calculation of DDR3L transfer speed

ddrddr3memory

I am not sure if this questions belongs to this stack exchange site but I didn't find other better one. In case it doesn't, let me know and I will move it to some other place.

I am working with a QorIQ T2080 Soc and DDR3L (DDR3-1866 mode, cycle time: 1.071ns @ CL=13 ) memory (model MT41K256M16). The configuration of the DDR3L allows me 1866M/T Data Rate with a 64bits bus-width, so, the theoretical peak achievable data-rate would be 14.9GB/s (1866Mhz * 64bits/8). However, performing a DMA memory copy test, the data-transfer-rate that I obtain is about 4GB/s (2GB/s x 2 because I am copying memory).

I would like to understand where this value comes from.

From what I know, correct me if I am wrong, DMAs normally create the data path when moving memory, and so, data doesn't go through them. Therefore, I didn't consider it as a data-rate limiting factor.

I reached the conclusion that the memory could be the limiting factor and made the following supposition:

The datasheet specify a burst length of 8 transactions of the bus size (64b). It is also specified a CAS latency of 13 for each column access to the memory (cycle time of 1.07ns). As I said before, the maximum peak achievable data-rate should be 14.9GB/s. However, each time we access a new column from the ram, data will only be received during 8 cycles at a frequency of 1866Mhz and then we will have to wait the true latency [3]: CL * 1.07 = 13.9 ns . Hence, I obtain a theoretical worst case:

$$ \frac{64B \: transferred \; in \; a \; burst }{\frac{8 \: burst \; length}{1.866 \: Ghz}+14ns}= 3.4 \; GB/s$$

In [1] they mention that commonly the DDR controller is able to obtain a higher throughput, getting even close to the peak data-rate.

My question is: Is this supposition right?

Other sources where I looked:

Best Answer

Your calculation of the peak throughput is correct, given the input.

The actual throughput largely depends on the design of the DMA and DRAM controllers. The 3.4 GB/s is a good approximation, but it relies on the assumption that the DMA controller does one 64-byte transfer at a time and each transfer hits already opened (active) DRAM bank.

In DRAM, reads and writes may be partially overlapping, so that the column address of the next operation is sent to the DRAM chip while it is still transferring data for the previous access. This is how actual throughput may get close to the theoretical peak. For this to happen, there should be enough buffer space to store the burst data, obviously. In memory to CPU transfers, usually the CPU cache can quickly consume or produce a relatively large chunk of bytes, when prefetching from memory. I am not sure about DMA controllers, though. Your DMA controller may have FIFO buffers, but I don't know if there are big enough to sustain a transfer longer than a single burst. Lacking enough buffers may be the reason you are not getting close to the theoretical throughput.

Another aspect is related to DRAM pages. In order to issue reads and writes, the corresponding page has to be open ("activated"). Typically, DRAM chips consist of several banks, each having its own page buffer. If source and destination of your transfers happen to reside in different banks, there may be no additional cost of opening a page. Otherwise, the controller may have to close and open pages on every access, which would slow down the transfers significantly (~30-50 ns each time). The controller may also choose to close the DRAM page after each access, implementing so called "closed page" policy, if configured to do so.

There will always be page open/close overhead when crossing pages, but it should be amortized in large sequential reads and writes.

Unfortunately, to get a more definitive answer, you have to dig through the DRAM and SoC/DMA controller datasheets, if they are available.

ADDED: I forgot to ask: why are you concerned about DMA memory to memory throughput? Since DMA is mostly intended for device access, it may not be designed for high speed and you may have better luck copying memory by CPU.