You can take a look at a DDR3 die and Xray photo of the same chip here : http://chipworksrealchips.blogspot.com/2011/02/how-to-get-5-gbps-out-of-samsung.html
You can see that the memory is organized along a central spine and that the pad are placed along this spine. I can't tell you more about the internal layout as it's not my field of expertise.
For DDR PCB layout you can read this Application Note :
For the chips' placement it's more a signal integrity issue as the timings are sensitives.
If your PCB and process technologies allows you any placement and your design is compliant with DDR/DDR2/DDR3 standard (mostly timing constraints) you are free to go with it.
I haven't seen a board with DDR3 memories for the moment, I only worked with a board with DDR2 chips. The five chips were placed on the same side (same or opposite side of the CPU) and side by side.
I can only recommend you to simulate your DDR design to be sure that your placement and routing are ok.
A DDR memory device actually consists of two distinct components:
1: A series of memory arrays composed mostly of capacitors, which are written to and read from using a very wide bank of differential amplifiers. This is fundamentally an analogue circuit, surprisingly enough.
2: An interface buffer, which allows the hundreds or thousands of individual bits produced by a single memory-array read cycle to be interfaced to a reasonable number of data lines to the Northbridge or CPU. Several cycles on the external interface are needed to transmit the data in the buffer.
In general, the feature size of semiconductor technology decreases over time as manufacturing technology is refined. This has different effects in the above two components.
For the memory array, the differential amplifiers become more sensitive and the individual capacitors become smaller. This allows a larger array to be constructed in the same die area, reading out more bits per cycle. The speed of the array remains roughly the same, however.
For the interface buffer, some of the data paths become shorter and therefore faster, required voltage swings reduce, and there is now space for better skew-correction, clock recovery, etc. This permits higher external signalling speeds within a reasonable power and area budget. The original DDR RAM simply transmitted data on both the rising and falling edges of the clock signal, instead of only on the rising edge as SDRAM did. More recent versions effectively multiply the basic clock signal as well.
This "basic clock signal" usually works out to around 200MHz in mainstream products of each generation, though faster and slower devices are also available. In original DDR, a 200MHz clock meant 400 MT/s, and was often described as 400MHz (or DDR-400) though the highest frequency signal is actually 200MHz. In DDR2, the basic clock is doubled using a PLL at both ends of the interface, so the actual clock rate is 400MHz and there are 800 MT/s. In DDR3 the clock is quadrupled and in DDR4 it is octupled, giving typically 3200 MT/s today. As you can imagine, the timing relative to the clock edges has to be controlled very carefully.
Since the memory arrays themselves haven't changed much in speed, these higher interface speeds come with increased "column strobe latency" (CL) figures. These describe how many transfer cycles elapse between providing the address and receiving the data, and are used to accommodate the limited speed of the memory arrays relative to the interface bus.
One of the things that the basic clock controls more-or-less directly, rather than through a PLL, is the self-refresh cycle of the memory arrays. Using capacitors to store bits is very space-efficient, but the charge leaks out of them rather easily and weakens the indication within a few tens of milliseconds, so the memory arrays must constantly cycle through their contents, reading and re-writing them to ensure they remain valid.
Best Answer
The UDQS/LDQS strobe signals are strictly for timing; they are not optional, and they must make a transition for every byte transferred and cannot be gated. Remember, data is transferred on BOTH edges of these strobes. The reason there is one for each lane is to relax the constraints on PCB trace skew to just a byte at a time, rather than across all of the bits of a wide interface.
The UDM/LDM signals are mask signals whose timing is the same as the data itself — indeed, these signals are themselves clocked by UDQS/LDQS just like the data is.
When doing a burst transfer where only some of the bytes are being written, it wouldn't work to omit some of the UDQS/LDQS transitions on only some of the bytes.