Architecture – Understanding memory update propagation in x86/x86-64 CPU L1/L2/L3 caches and RAM

Architecturecachingcpumemory

I'm trying to understand in a general sense how L1/L2 (and now L3 caches) are updated and how the updates are propagated in a multi-core x86/x86-64 CPU.

Assuming a 4 core CPU and 2 pairs of L1/L2 caches, where pairs of cores share a common L1/L2 pair and there's an interconnect between the 2 L1/L2 pairs. And the cache lines are 64-bit wide.
So we have:

  1. Core-0/Core-1 on (L1/L2)-0
  2. Core-2/Core-3 on (L1/L2)-1
  3. (L1/L2)-0 is connected to (L1/L2)-1

Let's say there is a thread T0 running on Core-0 that is writing to a 64-bit integer variable called X, and there's another thread T1 on Core-3 continually reading variable X's value – Please ignore the logical race conditions for a moment.

Question:
Assuming X has been cached in Core-0's L1. When T0 writes a new value to X, Is the following sequence of events correct?

  1. X's value is pushed from register to L1-0
  2. X's value is pushed from L1-0 to L2-0
  3. X's value is pushed from L2-0 to L2-1
  4. X's value is pushed from L2-1 to L1-1
  5. X's value is pushed from L1-0 to RAM

Note: Steps 2 and 4 may happen concurrently.

Best Answer

First if you are concerned about recent (last 5-10 years, since Nahalem?) Intel x86 architecture, then you're a little off in your description of the caches.

Each core has their own 128K L1 cache split (64K data / 64K code). Above that, each core has its own L2 cache which basically acts as a buffer between the L1 and L3 cache. Each socket has its own L3 cache (up to 20MB, I think). The L1 and L2 caches are small, simple, and quick. The L3 cache is much larger and more complex (it is divided into small segments that all sit on a ring with other things like the memory controller and QPI bridge to other sockets). (Please ignore things like the load and store buffers that will make this even more complex.)

Also, cache lines are 64-bytes.

So it looks like this (on for example a dual socket machine, with 4 cores per CPU):

Core 0 (socket 0) -> L1-0 -> L2-0 -> L3-0 -> QPI link to socket 1
Core 1 (socket 0) -> L1-1 -> L2-1 -> ^
Core 2 (socket 0) -> L1-2 -> L2-2 -> |
Core 3 (socket 0) -> L1-3 -> L2-3 -> +

Core 4 (socket 1) -> L1-4 -> L2-4 -> L3-1 -> QPI link to socket 0
Core 5 (socket 1) -> L1-5 -> L2-5 -> ^
Core 6 (socket 1) -> L1-6 -> L2-6 -> |
Core 7 (socket 1) -> L1-7 -> L2-7 -> +

Pre Nahalem, L2 caches were shared among core pairs. I haven't had to do a lot of performance work on this in a while, so I'm not really sure of the subtleties there.

The L3 cache is fully inclusive of the L1 and L2 caches below it. The cache contains the "correct" values for all memory addresses.

More correct than main memory, since writes can sit in L3 for a while before going to memory (write-back caching). All caches are coherent. That is, you will never have two different values for the same memory location. This coherency is maintained by a version of the MESI protocol, called MESIF (for Intel, AMD has a different caching strategy and uses MEOSI and arranged their caching differently).

Since the L1 and L2 are private to the core, the coherence only has to be managed at the L3 level (I think, I've been unable to get a definite answer on this). The cache interconnects have four lanes: data, request, acknowledge, and snoop (to keep up to date on other memory operations).

Now, we can get down to your questions.

If a thread on Core 0 is reading an address, the address will reside in L1-0 and L3-0 in either the Exclusive, Shared, or Forward state (all three show that the address is unmodified and cached). Now, Core-4 wants to write to it. A Request/Read-For-Ownership will send the cache line from the other L3 cache (L3-0) and cause the other caches to mark their copies as Invalid. It will now be in L1-4 and L3-1 (marked as Exclusive).

(Here is where ignoring store buffers simplified a lot.)

Core-4 will write from a register to the L1-4 cache. Causing to transition the line to the Modified state. This gets propagated to the L3-1 cache (since it is fully inclusive).

Now, Core-0 wants to read again. The L1-0 cache is Invalid at that address, so it sends a read request that misses the L3-0 cache and causes the L3-1 to send the cache line back across. The L3-1 now states that address is Shared, and the L3-0 holds the line as Forward (the most recent requester gets the Forwarding responsibility).

Clear as mud? There might be a few edits to this to clean up some of the language that I might be too vague on.