A DDR memory device actually consists of two distinct components:
1: A series of memory arrays composed mostly of capacitors, which are written to and read from using a very wide bank of differential amplifiers. This is fundamentally an analogue circuit, surprisingly enough.
2: An interface buffer, which allows the hundreds or thousands of individual bits produced by a single memory-array read cycle to be interfaced to a reasonable number of data lines to the Northbridge or CPU. Several cycles on the external interface are needed to transmit the data in the buffer.
In general, the feature size of semiconductor technology decreases over time as manufacturing technology is refined. This has different effects in the above two components.
For the memory array, the differential amplifiers become more sensitive and the individual capacitors become smaller. This allows a larger array to be constructed in the same die area, reading out more bits per cycle. The speed of the array remains roughly the same, however.
For the interface buffer, some of the data paths become shorter and therefore faster, required voltage swings reduce, and there is now space for better skew-correction, clock recovery, etc. This permits higher external signalling speeds within a reasonable power and area budget. The original DDR RAM simply transmitted data on both the rising and falling edges of the clock signal, instead of only on the rising edge as SDRAM did. More recent versions effectively multiply the basic clock signal as well.
This "basic clock signal" usually works out to around 200MHz in mainstream products of each generation, though faster and slower devices are also available. In original DDR, a 200MHz clock meant 400 MT/s, and was often described as 400MHz (or DDR-400) though the highest frequency signal is actually 200MHz. In DDR2, the basic clock is doubled using a PLL at both ends of the interface, so the actual clock rate is 400MHz and there are 800 MT/s. In DDR3 the clock is quadrupled and in DDR4 it is octupled, giving typically 3200 MT/s today. As you can imagine, the timing relative to the clock edges has to be controlled very carefully.
Since the memory arrays themselves haven't changed much in speed, these higher interface speeds come with increased "column strobe latency" (CL) figures. These describe how many transfer cycles elapse between providing the address and receiving the data, and are used to accommodate the limited speed of the memory arrays relative to the interface bus.
One of the things that the basic clock controls more-or-less directly, rather than through a PLL, is the self-refresh cycle of the memory arrays. Using capacitors to store bits is very space-efficient, but the charge leaks out of them rather easily and weakens the indication within a few tens of milliseconds, so the memory arrays must constantly cycle through their contents, reading and re-writing them to ensure they remain valid.
The difference is S2 is the speed (in terms of bandwidth/IO throughput) of LPDDR1, and S4 is the speed you'd expect LPDDR2 SDRAM to be.
This might seem strange, but let me explain through a brief bit of standards history:
LPDDR1 is essentially just DDR1 but at 1.8V instead of 2.5V. It is otherwise the same as what desktops used to use.
The LPDDR2 is not merely incremental performance increase like you're probably used to on desktops, but a much more radical departure from DDR1 and DDR2 RAM in favor of mobile and low power features. It is not at all compatible with DDR1 or DDR2, and honestly ought to have a less-similar name to convey this, in my opinion anyway.
One of the features of LPDDR2 is support different speeds (hence the 'S' before the number) of RAM. LPDDR2-S2 is LPDDR1 memory (in terms of IO throughput) with the extra power saving features and other niceties of LPDDR2. So yes, S2 is going to give you LPDDR1 speeds only, despite being called LPDDR2. This is a good thing however, as in applications with modest throughput requirements, you can take advantage of the lower power and cost of LPDDR1, but also all the improvements and further optimization of power consumption of LPDDR2. If there was not S2 RAM, then the only option would be to use LPDDR1. Also, because of the smaller prefetch buffer size (which I will explain further down), S2 is also lower latency at a fixed clock compared to S4.
S4 ram has twice the throughput of S2 RAM, and has the speed improvement/difference you'd expect to see moving from LPDDR1 to LPDDR2, as well as the increased latency.
Summary:
Unlike DDR1 vs DDR2 memory, which was mostly just a performance improvement, LPDDR1 vs LPDDR2 improved performance as well as adding significant power saving features. Slower RAM uses less power, but slower ram using a more recent standard with further improvements to power consumption uses the least. That is what S2 is - the slower LPDDR1 memory but with LPDDR2 power saving improvements.
OK, let's get technical
I superficially answered your question, but here is a much more technical answer as well!
First, I'm going to discuss what DDR memory even is.
SDRAM is arranged in rows of words (which are usually the same length as the width of the bus, so if the bus was 32-bit, a word would be 32-bits long). The row will have lots of words, 2048 would be a normal number today. Each of these 2048 words has a location in the row, which is called a column. Nothing exciting yet - this should be familiar and straight forward so far.
There is an asymmetry in the time it takes to access things in SDRAM, however. It takes the longest of any operation to move to a particular row. Once you are at that row, reading a word from a column in that row is trivially quick. The costly thing is changing rows, but you can read a lot of stuff very quickly if its all on the same row.
Enter prefetching. Most of the time (not always, but definitely the majority), a CPU is going to want words directly adjacent to a word it requests, as well. With SDRAM, the CPU had to ask for each one, and that meant one clock cycle per word.
DDR ram prefetches an additional word, the adjacent word directly after the one the CPU requests. Both are loaded into a prefetch buffer, and are 'burst' out of the IO pins on the rising and falling clock edge. Now, the CPU only asked for the first word, and it may not need or care about the second word, so there is a big assumption here. However, it is so often the case that the CPU does in fact want that second word as well, that DDR memory has indeed yielded large increases in memory performance for computers.
Anyway, this is why the MT/s is double that of the clock. This is what the S2 is referring to - it is using a 2n prefetch, meaning it prefetches 2 words per column request.
DDR2 memory internally doubles the IO clock, yielding 4 edge changes per external bus clock cycle, and uses this to implement an otherwise identical prefetch system as DDR1, only it prefetches 4 words (since it has 4 edge changes to clock them out to the bus on) instead of 2. This is what 4n prefetch refers to, and likewise what the 'S4' suffix in LPDDR2-S4 ram is referring to.
So, at the fundamental technical level, S2 LPDDR2 memory is only going to achieve double the IO clock, just as if it were LPDDR1 memory. In terms of the bus IO, it is LPDDR1 memory, but interfaced like faster LPDDR2-S4 memory. Yeah, the throughput is not so good, but you get lower power, cost, and latency in return. Like most things, it's a trade off. Let your application dictate the one you choose as they both have their uses.
Best Answer
Your calculation of the peak throughput is correct, given the input.
The actual throughput largely depends on the design of the DMA and DRAM controllers. The 3.4 GB/s is a good approximation, but it relies on the assumption that the DMA controller does one 64-byte transfer at a time and each transfer hits already opened (active) DRAM bank.
In DRAM, reads and writes may be partially overlapping, so that the column address of the next operation is sent to the DRAM chip while it is still transferring data for the previous access. This is how actual throughput may get close to the theoretical peak. For this to happen, there should be enough buffer space to store the burst data, obviously. In memory to CPU transfers, usually the CPU cache can quickly consume or produce a relatively large chunk of bytes, when prefetching from memory. I am not sure about DMA controllers, though. Your DMA controller may have FIFO buffers, but I don't know if there are big enough to sustain a transfer longer than a single burst. Lacking enough buffers may be the reason you are not getting close to the theoretical throughput.
Another aspect is related to DRAM pages. In order to issue reads and writes, the corresponding page has to be open ("activated"). Typically, DRAM chips consist of several banks, each having its own page buffer. If source and destination of your transfers happen to reside in different banks, there may be no additional cost of opening a page. Otherwise, the controller may have to close and open pages on every access, which would slow down the transfers significantly (~30-50 ns each time). The controller may also choose to close the DRAM page after each access, implementing so called "closed page" policy, if configured to do so.
There will always be page open/close overhead when crossing pages, but it should be amortized in large sequential reads and writes.
Unfortunately, to get a more definitive answer, you have to dig through the DRAM and SoC/DMA controller datasheets, if they are available.
ADDED: I forgot to ask: why are you concerned about DMA memory to memory throughput? Since DMA is mostly intended for device access, it may not be designed for high speed and you may have better luck copying memory by CPU.