Memory/cache performance in working with arrays in C

cc99memory

I've been toying with some array examples in C. I want to understand more about memory concepts, alignment, and cache. Especially on large arrays on heap. Sometimes I work on large images (extremely large) so that's why it's important to me, but it doesn't matter for what I'm asking now.

I wrote a following example:

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <GLFW/glfw3.h>

#define ARRAY_NUM 1000000 * 1000 // GIG
int main(int argc, char *argv[]) {

    if(!glfwInit()) {
        exit(EXIT_FAILURE);
    }

    uint64_t *array = malloc(sizeof(uint64_t) * ARRAY_NUM);

    double time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array[i] = 0xff;
    }
    double time_end = glfwGetTime();

    free(array);
    glfwTerminate();

    double performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Done in %f\n", (time_end - time_start));
    printf("Performance: %f MB/s\n", performance);

    exit(EXIT_SUCCESS);
}

With 1 gig allocated I get:

Done in 36.126213
Performance: 221.445854 MB/s

300 meg I get:

Done in 7.564931
Performance: 317.253391 MB/s

200 meg I get:

Done in 1.279391
Performance: 1250.594728 MB/s

100 meg I get:

Done in 0.355763
Performance: 2248.685313 MB/s

10 meg I get:

Done in 0.036575
Performance: 2187.315761 MB/s

and 1 meg I get:

Done in 0.004050
Performance: 1975.160383 MB/s

Of course, timings vary ~5-10% between runs.

Now, what I'm wondering here is what can and should be done in order to have consistent maximum throughput one could get in order to traverse arrays on the heap – especially large ones. Something tells me that even highest numbers of 2+ GB/s aren't top numbers here, and what interests me is 1 meg result vs 10 and 100 – it seems cache gets hot only after certain time.

I am using C99 (gcc5) and my understanding is that with malloc and uint64_t I ought to get aligned memory already. If not, is there maybe a sure way to that? If that's the culprit altogether, of course. Maybe I'm hitting cache lines wrong, or hitting OS's VM pages that it doesn't like? Where would one even start in optimizing linear access within a large array?

Details:

  • I've done this on a 2011 MacbookAir4,2 (I5-2557M) with 4GB RAM and SSD (maybe that's the problem – low amount of RAM and malloc couldn't get the block in one go? I have to test this on my workstation when I get to it)
  • gcc 5.2.0 (homebrew) with flags: -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O9 (yeah, niner… I use 3 otherwise, but you never know) with additional include and library flags as well as framework flags in order to use glfw timer which I tend to use. I could've done it without, it doesn't matter.

EDIT:

After some thoughts Erik implanted me with, I made another test. Showing some of the stuff related to pre-fetching and, hopefully, cache as well.

#include <stdio.h>
#include <stdlib.h>
#include <string.h> // memcpy
#include <inttypes.h>
#include <GLFW/glfw3.h>

#define ARRAY_NUM 1000000 * 125 // GIG
int main(int argc, char *argv[]) {

    double performance = 0.0;

    if(!glfwInit()) {
        exit(EXIT_FAILURE);
    }

    uint64_t *array = malloc(sizeof(uint64_t) * ARRAY_NUM);
    uint64_t *array_copy = malloc(sizeof(uint64_t) * ARRAY_NUM);

    double time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array[i] = 0xff;
    }
    double time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Init done in %f - size of array: %lu MBs (x2)\n", (time_end - time_start), (ARRAY_NUM*sizeof(uint64_t)/1000000));
    printf("Performance: %f MB/s\n\n", performance);

    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array_copy[i] = array[i];
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (linear) done in %f\n", (time_end - time_start));
    printf("Performance: %f MB/s\n\n", performance);

    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; i=i+8) {
        array_copy[i] = array[i];
        if(i < (ARRAY_NUM-8)) {
            array_copy[i+1] = array[i+1];
            array_copy[i+2] = array[i+2];
            array_copy[i+3] = array[i+3];
            array_copy[i+4] = array[i+4];
            array_copy[i+5] = array[i+5];
            array_copy[i+6] = array[i+6];
            array_copy[i+7] = array[i+7];
        }
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (stride 8) done in %f\n", (time_end - time_start));
    printf("Performance: %f MB/s\n\n", performance);



    const int imax = 100;
    double performance_average = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; i=i+8) {
            tmp = array[i];
            if(i < (ARRAY_NUM-8)) {
                tmp = array[i+1];
                tmp = array[i+2];
                tmp = array[i+3];
                tmp = array[i+4];
                tmp = array[i+5];
                tmp = array[i+6];
                tmp = array[i+7];
            }
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        printf("[%d/%d] Performance stride 8: %f MB/s\r", j+1, imax, performance);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %f MB/s\n\n", performance_average);

    performance_average = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
            tmp = array[i];
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        printf("[%d/%d] Performance dumb: %f MB/s\r", j+1, imax, performance);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %f MB/s\n\n", performance_average);

    performance = 0;
    time_start = glfwGetTime();
    memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (memcpy) done in %f\n", (time_end - time_start));
    printf("Performance: %f MB/s\n", performance);

    free(array);
    free(array_copy);
    glfwTerminate();

    exit(EXIT_SUCCESS);
}

If you don't want to use glfw's timer, just replace it with yours. I compile it with -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 and -fprefetch-loop-arrays doesn't seem to have effects, so I left it out.

Now I can see effects of pre-fetching in play. Here's a sample output:

Init done in 0.784799 - size of array: 1000 MBs (x2)
Performance: 1274.211087 MB/s

Copying (linear) done in 2.086404
Performance: 479.293545 MB/s

Copying (stride 8) done in 0.313592
Performance: 3188.856625 MB/s

[100/100] Performance stride 8: 6458.897164 MB/s
Average: 6393.163000 MB/s

[100/100] Performance dumb: 2597.816831 MB/s
Average: 2530.225830 MB/s

Copying (memcpy) done in 0.202581
Performance: 4936.303056 MB/s

You can see how reading 8 values per loop (arbitrary value for now) yields more than twice the bandwidth! What surprised me was that copying was x6+ increase. I have included memcpy as a reference, which is even more faster. Interesting play. I would like to hear more about what to look for and how to approach optimal read/write on arrays and possibly non-trivial payloads (I chose uint64_t in this case).

edit2:

with a stride of 40 (no more, no less – 320bytes) I got reads of ~7800 MB/s. Which is closer to maximum of 10600 (1333MhZ DDR3).

Best Answer

Your tests are basically looping straight thru memory, so the cache is working for you roughly as follows:

You touch some memory not in any cache, which triggers a cache line load from the closest cache, L1, which cascades thru the L2 and L3 caches, until main memory is reached. An L3-cache-line-sized chunk of data from main memory is fetched to fill the L3. Then an L2-cache-line-sized chunk of data from L3 is fetched to fill the L2, and so with L1 filling from L2.

Your program can now resume processing of the memory location that triggered the miss. The next few iterations of your loop will proceed with data from the L1 cache, because a cache line size is larger than the 8 bytes you are touching in each iteration. After some small number of iterations, the cache line end is reached, and this triggers another cache miss. It may or may not go all the way out to main memory again, depending on if the L2 or L3 cache line size is the same or larger than the L1 cache line size.

Your loop is generally only benefiting from the cache by the cache line size as more data then one iteration is asking for is fetched by the hardware.

Some processors (Itanium) have a mechanism for the software to inform the hardware to try to perform cache line loads in advance. This can cover some of the cost of the main memory loads.

Since your loop never revisits any of the memory it touches, you don't get any further benefit from the cache. If you wanted to experiment further, you could consider:

Writing an outer loop that steps by some fixed stepping amount and some inner loops that use data in the stepping range repeatedly: So, a triply nested loop with the outer iterating by the fixed stepping amount and the first inner providing simple repetition (e.g. 100 iterations) and the inner most loop accessing the range of the array as per the stepping amount. Essentially we're going to access each memory location in the array 100 times, but change the order and grouping of how we access those locations, by varying the stepping size in different runs. This should show substantially more variation (orders of magnitude) between when the caches are working well and they are not.

With small stepping size the caches should work well and the 100 iterations over the each small range will benefit from greatly from cache reuse, and with large stepping sizes, performance should fall off dramatically, being closer to what you're observing with your straight run thru memory.

(All this assumes, of course, that the compiler doesn't mess with your code too much, sometimes overly simple benchmarks get optimized away...)

Related Topic