Pointer Access and Cache Coherency in Programming

cachingpointers

To my understanding, when you access a variable, that variable and the surrounding area of memory is put into the L1 cache. If I'm wrong here, please tell me.

Now my question is, say I have an array of pointers, and I want to iterate through them all performing operation X. If I first have to access the pointer, to get the address of the actual data, does this mean that the memory near the pointer will be put into cache, then the memory near the data, then back to the pointer, then to the data etc? i.e. Cache thrashing?

If this is the case, how might one go about keep cache coherency?

Best Answer

Whether or not the data will evict the pointer from cache depends on how how much memory you actually touch, how memory addresses are mapped to cache lines, and on the replacement policy. The most common replacement policy is Least Recently Used (sometimes just approximating the least used) where the new data will replace the data which was used the least recently in the cache. Memory addresses are commonly mapped in N-way sets where multiple memory addresses map to the same cache line, and each address can map to N cache lines (see here). N will probably be 2 or 4.

Under these conditions, you will almost assuredly not have cache thrashing if you are dereferencing a pointer then using the data IF the data you touch after dereferencing fits into the cache line. In both memory accesses, the least recently used cache line is definitely not the line with the pointer (in case of the data access), or the data (in case of the pointer dereferencing) since those were the last (most recently) touched cache lines. If you touch more memory than what is in the cache line with the data after dereferencing the pointer then it is possible that you will evict the pointer data from cache, but that depends on where the data is in memory, and how much you touch. It is possible that more data access will not evict the pointers if the addresses do not map to the cache line with the pointers or the pointer line is not the least recently used in the set of lines the addresses map to, but since the memory address of your data allocation is usually not known until the program runs it is hard to say either way just from source code.

It is probably possible to have your data and pointers allocated at specific memory addresses just to keep hitting cache as much as possible, but that would require you know the policies of the cache of the processor running your code, and that the compiler doesn't make optimizations which change the data access patterns without you knowing. It would also be processor specific, with different cache policies on different processors making any micro optimization on this scale worthless. As I note below, it is very difficult to find information about specific policies on common processors so it is probably not worth thinking about this kind of optimization at all. In general, consecutive memory address accesses are the best way of hitting cache if you are only touching things once (like in your loop over the pointers). No other optimization is needed.

This is all assuming a single processor with a single thread. Multiple threads and multiple processors throw wrenches in reasoning about cache. Multiple threads on a single processor means that another thread could start running in the middle of your loop and when your thread starts running again, the cache is in a completely different state and the whole process starts over, but otherwise not much is different.

If you have a multi threaded loop (think one thread per pointer) running on multiple processors, then the whole cache thing is pretty difficult to reason about. The order of execution could be rearranged, or consecutive pointers could be dereferenced on different processors. Both situations means that the cache probably has not be primed by previous pointer dereferences or data accesses. In this case you should consider chunking data so a thread accesses a range of pointers instead of a single one so the cache can actually be used.

Remember that the memory you touch after dereferencing has to be SMALL. A few (read 2 or 4) ints would fit in a cache line, but an object with many members might not. If you access data stored more than a cache line's size from the beginning of the object, you are evicting more from cache and thus have more chance of evicting the pointers.

Finding data about the caches in Intel and AMD processors is kind of difficult, but I found this link from 2010 saying that Intel uses a pseudo LRU policy on a 4-way associative cache for L2 cache, and the same policy for L1 cache (no set associativity is mentioned) on at least one of their processor architectures. It should be noted that the replacement policy is an integral component in making cache speed up programs. I would not be surprised in the least if they are more complicated or harder to reason about than what I presented above since unmodified LRU is not universally the best replacement policy. Actual replacement policies probably have much more complicated rules in the name of keeping cache hits high, but exact details would be scarce since they are a competitive edge in the performance benchmark wars.