One thing that used to be common for video graphic controllers is Video RAM or VRAM.
VRAM has two sets of data output pins, and thus two ports that can be used simultaneously. The first port, the DRAM port, is accessed by the host computer in a manner very similar to traditional DRAM. The second port, the video port, is typically read-only and is dedicated to providing a high throughput, serialized data channel for the graphics chipset.
Internally, VRAM reads an entire DRAM row and shifts it out sequentially to the video circuitry. This leave the DRAM available for use by the MPU. VRAM has largely been replaced by the use of SDRAM, "even though it is only single-ported and more overhead is required".
A technique I have used in the past is to use interleaved access to memory. It's a bit complex to explain (the devil is in the details), but I will outline the basics:
Basically the MPU accesses video memory in between pixel accesses by the video-controller. If this timing gets too tight, there are a couple things you can do that will greatly relieve the timing (usually only 1 of these is necessary):
- You can use 2 RAM chips (or banks) and interleave those using each chip for every-other pixel. In your case, this would effectively slow your pixel clock to 80ns per chip allowing MPU and video-controller access to have windows of 40ns each. This could be extended to more banks interleaving more pixels if necessary. This technique is called Interleaved Memory.
- You can increase the data-bus size of the video memory. The video-controller would read multiple pixels in a single access and use them sequentially. The MPU would either have a larger data-bus, each access would be directed to the appropriate byte (or word) and byte-selects would be used on the video memory, or a read-modify-write would have to be performed to write to the larger data size. In your case, it would probably be simplest to increase the video memory data bus to 16 or 32 bits (2 or 4 pixels), and probably then use an MPU with the same bus size.
If you interleave video accesses, you may want to consider the use of an FPGA or CPLD for your video memory controller.
Another method is to have 2 separate video memories and use bank-select. The MPU writes to one bank while the other is being used by the video-controller for display. When the MPU is finished writing, the bank accesses are swapped (usually during a sync pulse).
A DDR memory device actually consists of two distinct components:
1: A series of memory arrays composed mostly of capacitors, which are written to and read from using a very wide bank of differential amplifiers. This is fundamentally an analogue circuit, surprisingly enough.
2: An interface buffer, which allows the hundreds or thousands of individual bits produced by a single memory-array read cycle to be interfaced to a reasonable number of data lines to the Northbridge or CPU. Several cycles on the external interface are needed to transmit the data in the buffer.
In general, the feature size of semiconductor technology decreases over time as manufacturing technology is refined. This has different effects in the above two components.
For the memory array, the differential amplifiers become more sensitive and the individual capacitors become smaller. This allows a larger array to be constructed in the same die area, reading out more bits per cycle. The speed of the array remains roughly the same, however.
For the interface buffer, some of the data paths become shorter and therefore faster, required voltage swings reduce, and there is now space for better skew-correction, clock recovery, etc. This permits higher external signalling speeds within a reasonable power and area budget. The original DDR RAM simply transmitted data on both the rising and falling edges of the clock signal, instead of only on the rising edge as SDRAM did. More recent versions effectively multiply the basic clock signal as well.
This "basic clock signal" usually works out to around 200MHz in mainstream products of each generation, though faster and slower devices are also available. In original DDR, a 200MHz clock meant 400 MT/s, and was often described as 400MHz (or DDR-400) though the highest frequency signal is actually 200MHz. In DDR2, the basic clock is doubled using a PLL at both ends of the interface, so the actual clock rate is 400MHz and there are 800 MT/s. In DDR3 the clock is quadrupled and in DDR4 it is octupled, giving typically 3200 MT/s today. As you can imagine, the timing relative to the clock edges has to be controlled very carefully.
Since the memory arrays themselves haven't changed much in speed, these higher interface speeds come with increased "column strobe latency" (CL) figures. These describe how many transfer cycles elapse between providing the address and receiving the data, and are used to accommodate the limited speed of the memory arrays relative to the interface bus.
One of the things that the basic clock controls more-or-less directly, rather than through a PLL, is the self-refresh cycle of the memory arrays. Using capacitors to store bits is very space-efficient, but the charge leaks out of them rather easily and weakens the indication within a few tens of milliseconds, so the memory arrays must constantly cycle through their contents, reading and re-writing them to ensure they remain valid.
Best Answer
The difference is S2 is the speed (in terms of bandwidth/IO throughput) of LPDDR1, and S4 is the speed you'd expect LPDDR2 SDRAM to be.
This might seem strange, but let me explain through a brief bit of standards history:
LPDDR1 is essentially just DDR1 but at 1.8V instead of 2.5V. It is otherwise the same as what desktops used to use.
The LPDDR2 is not merely incremental performance increase like you're probably used to on desktops, but a much more radical departure from DDR1 and DDR2 RAM in favor of mobile and low power features. It is not at all compatible with DDR1 or DDR2, and honestly ought to have a less-similar name to convey this, in my opinion anyway.
One of the features of LPDDR2 is support different speeds (hence the 'S' before the number) of RAM. LPDDR2-S2 is LPDDR1 memory (in terms of IO throughput) with the extra power saving features and other niceties of LPDDR2. So yes, S2 is going to give you LPDDR1 speeds only, despite being called LPDDR2. This is a good thing however, as in applications with modest throughput requirements, you can take advantage of the lower power and cost of LPDDR1, but also all the improvements and further optimization of power consumption of LPDDR2. If there was not S2 RAM, then the only option would be to use LPDDR1. Also, because of the smaller prefetch buffer size (which I will explain further down), S2 is also lower latency at a fixed clock compared to S4.
S4 ram has twice the throughput of S2 RAM, and has the speed improvement/difference you'd expect to see moving from LPDDR1 to LPDDR2, as well as the increased latency.
Summary: Unlike DDR1 vs DDR2 memory, which was mostly just a performance improvement, LPDDR1 vs LPDDR2 improved performance as well as adding significant power saving features. Slower RAM uses less power, but slower ram using a more recent standard with further improvements to power consumption uses the least. That is what S2 is - the slower LPDDR1 memory but with LPDDR2 power saving improvements.
OK, let's get technical
I superficially answered your question, but here is a much more technical answer as well!
First, I'm going to discuss what DDR memory even is.
SDRAM is arranged in rows of words (which are usually the same length as the width of the bus, so if the bus was 32-bit, a word would be 32-bits long). The row will have lots of words, 2048 would be a normal number today. Each of these 2048 words has a location in the row, which is called a column. Nothing exciting yet - this should be familiar and straight forward so far.
There is an asymmetry in the time it takes to access things in SDRAM, however. It takes the longest of any operation to move to a particular row. Once you are at that row, reading a word from a column in that row is trivially quick. The costly thing is changing rows, but you can read a lot of stuff very quickly if its all on the same row.
Enter prefetching. Most of the time (not always, but definitely the majority), a CPU is going to want words directly adjacent to a word it requests, as well. With SDRAM, the CPU had to ask for each one, and that meant one clock cycle per word.
DDR ram prefetches an additional word, the adjacent word directly after the one the CPU requests. Both are loaded into a prefetch buffer, and are 'burst' out of the IO pins on the rising and falling clock edge. Now, the CPU only asked for the first word, and it may not need or care about the second word, so there is a big assumption here. However, it is so often the case that the CPU does in fact want that second word as well, that DDR memory has indeed yielded large increases in memory performance for computers.
Anyway, this is why the MT/s is double that of the clock. This is what the S2 is referring to - it is using a 2n prefetch, meaning it prefetches 2 words per column request.
DDR2 memory internally doubles the IO clock, yielding 4 edge changes per external bus clock cycle, and uses this to implement an otherwise identical prefetch system as DDR1, only it prefetches 4 words (since it has 4 edge changes to clock them out to the bus on) instead of 2. This is what 4n prefetch refers to, and likewise what the 'S4' suffix in LPDDR2-S4 ram is referring to.
So, at the fundamental technical level, S2 LPDDR2 memory is only going to achieve double the IO clock, just as if it were LPDDR1 memory. In terms of the bus IO, it is LPDDR1 memory, but interfaced like faster LPDDR2-S4 memory. Yeah, the throughput is not so good, but you get lower power, cost, and latency in return. Like most things, it's a trade off. Let your application dictate the one you choose as they both have their uses.