Electronic – Why is RAM not put on the CPU chip

cpumemory

Modern CPUs are very fast compared to all things external, including memory (RAM).

It is understandable, since CPU clock frequency has reached a point where it takes several clock ticks for an electric signal simply to run from from the CPU through the bus to RAM chips and back.

It also complicates life on many levels: multi-level cache hierarchies are built to deliver data closer to the CPU, which in turn require complex synchronization logic in the chip. Programs must be written in a cache-friendly way to avoid wait cycles while data is fetched.

Many of these problems could be avoided if a significant amount of RAM was located directly on the CPU chip. It doesn't have to an exclusive arrangement: maybe put 1-4 GB on the chip, depending on its class and allow additional memory installed separately.

I'm sure there are good reasons Intel, AMD and the like are not doing this. What are these reasons? Is it that there's no room to spare on the chip?

Best Answer

Intel's Haswell (or at least those products that incorporate the Iris Pro 5200 GPU) and IBM's POWER7 and POWER8 all include embedded DRAM, "eDRAM".

One important issue that has led eDRAM not to be common until recently is that the DRAM fabrication process is not inherently compatible with logic processes, so that extra steps must be included (which increase cost and decrease yield) when eDRAM is desired. So, there must be a compelling reason for wanting to incorporate it in order to offset this economic disadvantage. Alternatively, DRAM can be placed on a separate die that is manufactured independently of, but then integrated onto the same package as, the CPU. This provides most of the benefits of locality without the difficulties of manufacturing the two in a truly integrated way.

Another problem is that DRAM is not like SRAM in that it does not store its contents indefinitely while power is applied, and reading it also destroys the stored data, which must be written back afterwards. Hence, it has to be refreshed periodically and after every read. And, because a DRAM cell is based on a capacitor, charging or discharging it sufficiently that leakage will not corrupt its value before the next refresh takes some finite amount of time. This charging time is not required with SRAM, which is just a latch; consequently it can be clocked at the same rate as the CPU, whereas DRAM is limited to about 1 GHz while maintaining reasonable power consumption. This causes DRAM to have a higher inherent latency than SRAM, which makes it not worthwhile to use for all but the very largest caches, where the reduced miss rate will pay off. (Haswell and POWER8 are roughly contemporaneous and both incorporate up to 128MB of eDRAM, which is used as an L4 cache.)

Also, as far as latency is concerned, a large part of the difficulty is the physical distance signals must travel. Light can only travel 10 cm in the clock period of a 3 GHz CPU. Of course, signals do not travel in straight lines across the die and nor do they propagate at anything close to the speed of light due to the need for buffering and fan-out, which incur propagation delays. So, the maximum distance a memory can be away from a CPU in order to maintain 1 clock cycle of latency is a few centimetres at most, limiting the amount of memory that can be accommodated in the available area. Intel's Nehalem processor actually reduced the capacity of the L2 cache versus Penryn partly to improve its latency, which led to higher performance.* If we do not care so much about latency, then there is no reason to put the memory on-package, rather than further away where it is more convenient.

It should also be noted that the cache hit rate is very high for most workloads: well above 90% in almost all practical cases, and not uncommonly even above 99%. So, the benefit of including larger memories on-die is inherently limited to reducing the impact of this few percent of misses. Processors intended for the enterprise server market (such as POWER) typically have enormous caches and can profitably include eDRAM because it is useful to accommodate the large working sets of many enterprise workloads. Haswell has it to support the GPU, because textures are large and cannot be accommodated in cache. These are the use cases for eDRAM today, not typical desktop or HPC workloads, which are very well served by the typical cache hierarchies.

To address some issues raised in comments:

These eDRAM caches cannot be used in place of main memory because they are designed as L4 victim caches. This means that they are volatile and effectively content-addressable, so that data stored in them is not treated as residing in any specific location, and may be discarded at any time. These properties are difficult to reconcile with the requirement of RAM to be direct-mapped and persistent, but to change them would make the caches useless for their intended purpose. It is of course possible to embed memories of a more conventional design, as it is done in microcontrollers, but this is not justifiable for systems with large memories since low latency is not as beneficial in main memory as it is in a cache, so enlarging or adding a cache is a more worthwhile proposition.

As to the possibility of very large caches with capacity on the order of gigabytes, a cache is only required to be at most the size of the working set for the application. HPC applications may deal with terabyte datasets, but they have good temporal and spatial locality, and so their working sets typically are not very large. Applications with large working sets are e.g. databases and ERP software, but there is only a limited market for processors optimized for this sort of workload. Unless the software truly needs it, adding more cache provides very rapidly diminishing returns. Recently we have seen processors gain prefetch instructions, so caches are able to be used more efficiently: one can use these instructions to avoid misses caused by the unpredictability of memory access patterns, rather than the absolute size of the working set, which in most cases is still relatively small.

*The improvement in latency was not due only to the smaller physical size of the cache, but also because the associativity was reduced. There were significant changes to the entire cache hierarchy in Nehalem for several different reasons, not all of which were focused on improving performance. So, while this suffices as an example, it is not a complete account.

Related Solutions

Electronic – Why do CPUs need so much current

CPUs are not 'simple' by any stretch of the imagination. Because they have a few billion transistors, each one of which will have some small leakage at idle and has to charge and discharge gate and interconnect capacitance in other transistors when switching. Yes, each one draws a small current, but when you multiply that by the number of transistors, you end up with a surprisingly large number. 64A is an average current already...when switching, the transistors can draw a lot more than the average, and this is smoothed out by bypass capacitors. Remember that your 64A figure came from working backwards from the TDP, making that really 64A RMS, and there can be significant variation around that at many time scales (variation during a clock cycle, variation during different operations, variation between sleep states, etc.). Also, you might be able to get away with running a CPU designed to operate at 3 GHz on 1.2 volts and 64 amps at 1 volt and 1 amp....just maybe at 3 MHz. Although at that point you then have to worry about whether the chip uses dynamic logic that has a minimum clock frequency, so maybe you would have to run it at a few hundred MHz to a GHz and cycle it into deep sleep periodically to get the average current down. The bottom line is that power = performance. The performance of most modern CPUs is actually thermally limited.
This is relatively easy to calculate - \$I = C v \alpha f\$, where \$I\$ is the current, \$C\$ is the load capacitance, \$v\$ is the voltage, \$\alpha\$ is the activity factor, and \$f\$ is the switching frequency. I'll see if I can get ballpark numbers for a FinFET's gate capacitance and edit.
Sort of. The faster the gate capacitance is charged or discharged, the faster the transistor will switch. Charging faster requires either a smaller capacitance (determined by geometry) or a larger current (determined by interconnect resistance and supply voltage). Individual transistors switching faster then means they can switch more often, which results in more average current draw (proportional to clock frequency).

Edit: so, http://www.synopsys.com/community/universityprogram/documents/article-iitk/25nmtriplegatefinfetswithraisedsourcedrain.pdf has a figure for the gate capacitance of a 25nm FinFET. I'm just going to call it 0.1 fF for the sake of keeping things simple. Apparently it varies with bias voltage and it will certainly vary with transistor size (transistors are sized according to their purpose in the circuit, not all of the transistors will be the same size! Larger transistors are 'stronger' as they can switch more current, but they also have higher gate capacitance and require more current to drive).

Plugging in 1.25 volts, 0.1 fF, 3 GHz, and \$\alpha = 1\$, the result is \$0.375 \mu A\$. Multiply that by 1 billion and you get 375 A. That's the required average gate current (charge per second into the gate capacitance) to switch 1 billion of these transistors at 3 GHz. That doesn't count 'shoot through,' which will occur during switching in CMOS logic. It's also an average, so the instantaneous current could vary a lot - think of how the current draw asymptotically decreases as an RC circuit charges up. Bypass capacitors on the substrate, package, and circuit board with smooth out this variation. Obviously this is just a ballpark figure, but it seems to be the right order of magnitude. This also does not consider leakage current or charge stored in other parasitics (i.e. wiring).

In most devices, \$\alpha\$ will be much less than 1 as many of the transistors will be idle on each clock cycle. This will vary depending on the function of the transistors. For example, transistors in the clock distribution network will have \$\alpha = 1\$ as they switch twice on every clock cycle. For something like a binary counter, the LSB would have \$\alpha\$ of 0.5 as it switches once per clock cycle, the next bit would have \$\alpha = 0.25\$ as it switches half as often, etc. However, for something like a cache memory, \$\alpha\$ could be very small. Take a 1 MB cache, for example. A 1 MB cache memory built with 6T SRAM cells has 48 million transistors just to store the data. It will have more for the read and write logic, demultiplexers, etc. However, only a handful would ever switch on a given clock cycle. Let's say the cache line is 128 bytes, and a new line is written on every cycle. That's 1024 bits. Assuming the cell contents and the new data are both random, 512 bits are expected to be flipped. That's 3072 transistors out of 48 million, or \$\alpha = 0.000061\$. Note that this is only for the memory array itself; the support circuitry (decoders, read/write logic, sense amps, etc.) will have a much larger \$\alpha\$. Hence why cache memory power consumption is usually dominated by leakage current - that is a LOT of idle transistors just sitting around leaking instead of switching.

Electronic – the difference between EEPROM Data Memory and RAM

maybe this makes sense by way of an example, I dont have a pic compiler handy but it doesnt matter with respect to the question.

Take this code

unsigned int xyz=5;
unsigned int fun ( unsigned int x )
{
    return(x+xyz);
}

Compile it, assemble and link. Then disassemble to show the result of that

Disassembly of section .text:

00000000 <_start>:
   0:   e3a0d902    mov sp, #32768  ; 0x8000
   4:   eb000000    bl  c <fun>
   8:   eafffffe    b   8 <_start+0x8>

0000000c <fun>:
   c:   e59f3008    ldr r3, [pc, #8]    ; 1c <fun+0x10>
  10:   e5933000    ldr r3, [r3]
  14:   e0800003    add r0, r0, r3
  18:   e12fff1e    bx  lr
  1c:   20000000    

Disassembly of section .data:

20000000 <xyz>:
20000000:   00000005

So we have a global variable xyz, this is data, this goes in ram, we can read and write it, change its value, whatever, it has to be in read/write ram.

The program which is the machine code basically is from address 0x00000000 to 0x0000001F above.

So program memory also known as .text is the program the instructions. Data memory also known as .data is the read/write variables in particular ones like this that are pre-initialized to some value. Ones that are not initialized to some value fall into the .bss segment but are still in ram since they are read/write variables.

When the microcontroller is off, ram is off, it doesnt work it cannot store values. EEPROM and FLASH and other forms of nonvolatile memory are used. The processor/hardware is designed to know how to start using that memory and the programs are designed to operate from that memory. You can see we have an issue, we need to remember that the variable xyz needs to start with the value 5, but ram is volatile. So we need to have in non volatile memory both the program itself and any other items that we need to know. So all of .text and in the case of a microcontroller generally the program is executed from on chip flash and is accessed by the microcontrollers processor at the addresses we have linked. The .data values I have not shown the bootstrap magic here, but the bootstrap that runs before the entry point C function (usually main but that is arbitrary) has a few jobs minimum to do. first is set the stack pointer so we have a stack, normally the processor core does not know how much ram there is so the software does it. Second it needs to copy any .data values to ram so that the C program can find them, so in this case the value 0x00000005 will be somewhere in flash and we will know we have to copy that value to address 0x20000000 in ram. Third and not necessarily in this order, there will be address and size information for .bss, I dont have any bss variables in this example, but if there were we would have a starting address and number of bytes information that we would then have bootstrap code to zero out as programmer sometimes assume that a variable without a value assigned at compile time will be zero if they read it before they write it.

So data memory is the ram, I assume from your output that it is telling you how many bytes of variables the compiler has detected your program needs. The program memory is the program itself the machine code. Flash which you didnt ask about will hold the program memory and information about the data memory so your program can work. And then there is EEPROM, which is another form of non-volatile memory where you can save some items. often accessed through a library or code you write but not normally accessed as a simple variable name nor executed through as a program (in a microcontroller). Say your device is an mp3 player and the music is also in the flash perhaps in the form of a file system with files. You might want to add the feature that when the device is turned off, the last file being played and where in that file be saved so that when the power comes back it can continue. You could store this information in ram and have the battery in the device keep the ram alive, or you can store it in flash if there is space, or you can store it in EEPROM if your device has EEPROM. You would often use EEPROM for runtime information that you want to preserve through a power cycle. Like the odometer reading on a car, even if the battery is disconnected you want to remember the mileage for the car, so that when it is up and running again you can continue from that value and not start from zero every time. remember the users radio station preferences for shortcut buttons.

Best Answer

Related Solutions

Electronic – Why do CPUs need so much current

Electronic – the difference between EEPROM Data Memory and RAM

Related Topic