C++ – How many bytes does a Xeon bring into the cache per memory access

ccachingmemoryperformance

I am working on a system, written in C++, running on a Xeon on Linux, that needs to run as fast as possible. There is a large data structure (basically an array of structs) held in RAM, over 10 GB, and elements of it need to be accessed periodically. I want to revise the data structure to work with the system's caching mechanism as much as possible.

Currently, accesses are done mostly randomly across the structure, and each time 1-4 32-bit ints are read. It is a long time before another read occurs in the same place, so there is no benefit from the cache.

Now I know that when you read a byte from a random location in RAM, more than just that byte is brought into the cache. My question is how many bytes are brought in? Is it 16, 32, 64, 4096? Is this called a cache line?

I am looking to redesign the data structure to minimize random RAM accesses and work with the cache instead of against it. Knowing how many bytes are pulled into the cache on a random access will inform the design choices I make.

Update (October 2014):
Shortly after I posed the question above the project was put on hold. It has since resumed and based on suggestions in answers below, I performed some experiments around RAM access, because it seemed likely that TLB thrash was happening. I revised the program to run with huge pages (2MB instead of the standard 4KB), and observed a small speedup, about 2.5%. I found great information about setting up for huge pages here and here.

Best Answer

Today’s CPUs fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache.

More here : http://igoro.com/archive/gallery-of-processor-cache-effects/

Related Solutions

Linux – How to measure the actual memory usage of an application or process

With ps or similar tools you will only get the amount of memory pages allocated by that process. This number is correct, but:

does not reflect the actual amount of memory used by the application, only the amount of memory reserved for it
can be misleading if pages are shared, for example by several threads or by using dynamically linked libraries

If you really want to know what amount of memory your application actually uses, you need to run it within a profiler. For example, Valgrind can give you insights about the amount of memory used, and, more importantly, about possible memory leaks in your program. The heap profiler tool of Valgrind is called 'massif':

Massif is a heap profiler. It performs detailed heap profiling by taking regular snapshots of a program's heap. It produces a graph showing heap usage over time, including information about which parts of the program are responsible for the most memory allocations. The graph is supplemented by a text or HTML file that includes more information for determining where the most memory is being allocated. Massif runs programs about 20x slower than normal.

As explained in the Valgrind documentation, you need to run the program through Valgrind:

valgrind --tool=massif <executable> <arguments>

Massif writes a dump of memory usage snapshots (e.g. massif.out.12345). These provide, (1) a timeline of memory usage, (2) for each snapshot, a record of where in your program memory was allocated. A great graphical tool for analyzing these files is massif-visualizer. But I found ms_print, a simple text-based tool shipped with Valgrind, to be of great help already.

To find memory leaks, use the (default) memcheck tool of valgrind.

How do cache lines work

If the cache line containing the byte or word you're loading is not already present in the cache, your CPU will request the 64 bytes that begin at the cache line boundary (the largest address below the one you need that is multiple of 64).

Modern PC memory modules transfer 64 bits (8 bytes) at a time, in a burst of eight transfers, so one command triggers a read or write of a full cache line from memory. (DDR1/2/3/4 SDRAM burst transfer size is configurable up to 64B; CPUs will select the burst transfer size to match their cache line size, but 64B is common)

As a rule of thumb, if the processor can't forecast a memory access (and prefetch it), the retrieval process can take ~90 nanoseconds, or ~250 clock cycles (from the CPU knowing the address to the CPU receiving data).

By contrast, a hit in L1 cache has a load-use latency of 3 or 4 cycles, and a store-reload has a store-forwarding latency of 4 or 5 cycles on modern x86 CPUs. Things are similar on other architectures.

Further reading: Ulrich Drepper's What Every Programmer Should Know About Memory. The software-prefetch advice is a bit outdated: modern HW prefetchers are smarter, and hyperthreading is way better than in P4 days (so a prefetch thread is typically a waste). Also, the x86 tag wiki has lots of performance links for that architecture.

Best Answer

Related Solutions

Linux – How to measure the actual memory usage of an application or process

How do cache lines work

Related Topic