C++ – How many bytes does a Xeon bring into the cache per memory access

ccachingmemoryperformance

I am working on a system, written in C++, running on a Xeon on Linux, that needs to run as fast as possible. There is a large data structure (basically an array of structs) held in RAM, over 10 GB, and elements of it need to be accessed periodically. I want to revise the data structure to work with the system's caching mechanism as much as possible.

Currently, accesses are done mostly randomly across the structure, and each time 1-4 32-bit ints are read. It is a long time before another read occurs in the same place, so there is no benefit from the cache.

Now I know that when you read a byte from a random location in RAM, more than just that byte is brought into the cache. My question is how many bytes are brought in? Is it 16, 32, 64, 4096? Is this called a cache line?

I am looking to redesign the data structure to minimize random RAM accesses and work with the cache instead of against it. Knowing how many bytes are pulled into the cache on a random access will inform the design choices I make.

Update (October 2014):
Shortly after I posed the question above the project was put on hold. It has since resumed and based on suggestions in answers below, I performed some experiments around RAM access, because it seemed likely that TLB thrash was happening. I revised the program to run with huge pages (2MB instead of the standard 4KB), and observed a small speedup, about 2.5%. I found great information about setting up for huge pages here and here.

Best Answer

Today’s CPUs fetch memory in chunks of (typically) 64 bytes, called cache lines. When you read a particular memory location, the entire cache line is fetched from the main memory into the cache.

More here : http://igoro.com/archive/gallery-of-processor-cache-effects/