Memory/cache performance in working with arrays in C

cc99memory

I've been toying with some array examples in C. I want to understand more about memory concepts, alignment, and cache. Especially on large arrays on heap. Sometimes I work on large images (extremely large) so that's why it's important to me, but it doesn't matter for what I'm asking now.

I wrote a following example:

#include <stdio.h>
#include <stdlib.h>
#include <inttypes.h>
#include <GLFW/glfw3.h>

#define ARRAY_NUM 1000000 * 1000 // GIG
int main(int argc, char *argv[]) {

    if(!glfwInit()) {
        exit(EXIT_FAILURE);
    }

    uint64_t *array = malloc(sizeof(uint64_t) * ARRAY_NUM);

    double time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array[i] = 0xff;
    }
    double time_end = glfwGetTime();

    free(array);
    glfwTerminate();

    double performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Done in %f\n", (time_end - time_start));
    printf("Performance: %f MB/s\n", performance);

    exit(EXIT_SUCCESS);
}

With 1 gig allocated I get:

Done in 36.126213
Performance: 221.445854 MB/s

300 meg I get:

Done in 7.564931
Performance: 317.253391 MB/s

200 meg I get:

Done in 1.279391
Performance: 1250.594728 MB/s

100 meg I get:

Done in 0.355763
Performance: 2248.685313 MB/s

10 meg I get:

Done in 0.036575
Performance: 2187.315761 MB/s

and 1 meg I get:

Done in 0.004050
Performance: 1975.160383 MB/s

Of course, timings vary ~5-10% between runs.

Now, what I'm wondering here is what can and should be done in order to have consistent maximum throughput one could get in order to traverse arrays on the heap – especially large ones. Something tells me that even highest numbers of 2+ GB/s aren't top numbers here, and what interests me is 1 meg result vs 10 and 100 – it seems cache gets hot only after certain time.

I am using C99 (gcc5) and my understanding is that with malloc and uint64_t I ought to get aligned memory already. If not, is there maybe a sure way to that? If that's the culprit altogether, of course. Maybe I'm hitting cache lines wrong, or hitting OS's VM pages that it doesn't like? Where would one even start in optimizing linear access within a large array?

Details:

I've done this on a 2011 MacbookAir4,2 (I5-2557M) with 4GB RAM and SSD (maybe that's the problem – low amount of RAM and malloc couldn't get the block in one go? I have to test this on my workstation when I get to it)
gcc 5.2.0 (homebrew) with flags: -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O9 (yeah, niner… I use 3 otherwise, but you never know) with additional include and library flags as well as framework flags in order to use glfw timer which I tend to use. I could've done it without, it doesn't matter.

EDIT:

After some thoughts Erik implanted me with, I made another test. Showing some of the stuff related to pre-fetching and, hopefully, cache as well.

#include <stdio.h>
#include <stdlib.h>
#include <string.h> // memcpy
#include <inttypes.h>
#include <GLFW/glfw3.h>

#define ARRAY_NUM 1000000 * 125 // GIG
int main(int argc, char *argv[]) {

    double performance = 0.0;

    if(!glfwInit()) {
        exit(EXIT_FAILURE);
    }

    uint64_t *array = malloc(sizeof(uint64_t) * ARRAY_NUM);
    uint64_t *array_copy = malloc(sizeof(uint64_t) * ARRAY_NUM);

    double time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array[i] = 0xff;
    }
    double time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Init done in %f - size of array: %lu MBs (x2)\n", (time_end - time_start), (ARRAY_NUM*sizeof(uint64_t)/1000000));
    printf("Performance: %f MB/s\n\n", performance);

    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
        array_copy[i] = array[i];
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (linear) done in %f\n", (time_end - time_start));
    printf("Performance: %f MB/s\n\n", performance);

    performance = 0;
    time_start = glfwGetTime();
    for(uint64_t i = 0; i < ARRAY_NUM; i=i+8) {
        array_copy[i] = array[i];
        if(i < (ARRAY_NUM-8)) {
            array_copy[i+1] = array[i+1];
            array_copy[i+2] = array[i+2];
            array_copy[i+3] = array[i+3];
            array_copy[i+4] = array[i+4];
            array_copy[i+5] = array[i+5];
            array_copy[i+6] = array[i+6];
            array_copy[i+7] = array[i+7];
        }
    }
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (stride 8) done in %f\n", (time_end - time_start));
    printf("Performance: %f MB/s\n\n", performance);



    const int imax = 100;
    double performance_average = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; i=i+8) {
            tmp = array[i];
            if(i < (ARRAY_NUM-8)) {
                tmp = array[i+1];
                tmp = array[i+2];
                tmp = array[i+3];
                tmp = array[i+4];
                tmp = array[i+5];
                tmp = array[i+6];
                tmp = array[i+7];
            }
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        printf("[%d/%d] Performance stride 8: %f MB/s\r", j+1, imax, performance);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %f MB/s\n\n", performance_average);

    performance_average = 0.0;
    for(int j = 0; j < imax; ++j) {
        uint64_t tmp = 0;
        performance = 0;
        time_start = glfwGetTime();
        for(uint64_t i = 0; i < ARRAY_NUM; ++i) {
            tmp = array[i];
        }
        time_end = glfwGetTime();

        performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
        performance_average += performance;
        printf("[%d/%d] Performance dumb: %f MB/s\r", j+1, imax, performance);
        fflush(stdout);
    }
    performance_average = performance_average / imax;
    printf("\nAverage: %f MB/s\n\n", performance_average);

    performance = 0;
    time_start = glfwGetTime();
    memcpy(array_copy, array, ARRAY_NUM*sizeof(uint64_t));
    time_end = glfwGetTime();

    performance = ((ARRAY_NUM * sizeof(uint64_t))/1000000) / (time_end - time_start);
    printf("Copying (memcpy) done in %f\n", (time_end - time_start));
    printf("Performance: %f MB/s\n", performance);

    free(array);
    free(array_copy);
    glfwTerminate();

    exit(EXIT_SUCCESS);
}

If you don't want to use glfw's timer, just replace it with yours. I compile it with -pedantic -std=c99 -Wall -Werror -Wextra -Wno-unused -O0 and -fprefetch-loop-arrays doesn't seem to have effects, so I left it out.

Now I can see effects of pre-fetching in play. Here's a sample output:

Init done in 0.784799 - size of array: 1000 MBs (x2)
Performance: 1274.211087 MB/s

Copying (linear) done in 2.086404
Performance: 479.293545 MB/s

Copying (stride 8) done in 0.313592
Performance: 3188.856625 MB/s

[100/100] Performance stride 8: 6458.897164 MB/s
Average: 6393.163000 MB/s

[100/100] Performance dumb: 2597.816831 MB/s
Average: 2530.225830 MB/s

Copying (memcpy) done in 0.202581
Performance: 4936.303056 MB/s

You can see how reading 8 values per loop (arbitrary value for now) yields more than twice the bandwidth! What surprised me was that copying was x6+ increase. I have included memcpy as a reference, which is even more faster. Interesting play. I would like to hear more about what to look for and how to approach optimal read/write on arrays and possibly non-trivial payloads (I chose uint64_t in this case).

edit2:

with a stride of 40 (no more, no less – 320bytes) I got reads of ~7800 MB/s. Which is closer to maximum of 10600 (1333MhZ DDR3).

Best Answer

Your tests are basically looping straight thru memory, so the cache is working for you roughly as follows:

You touch some memory not in any cache, which triggers a cache line load from the closest cache, L1, which cascades thru the L2 and L3 caches, until main memory is reached. An L3-cache-line-sized chunk of data from main memory is fetched to fill the L3. Then an L2-cache-line-sized chunk of data from L3 is fetched to fill the L2, and so with L1 filling from L2.

Your program can now resume processing of the memory location that triggered the miss. The next few iterations of your loop will proceed with data from the L1 cache, because a cache line size is larger than the 8 bytes you are touching in each iteration. After some small number of iterations, the cache line end is reached, and this triggers another cache miss. It may or may not go all the way out to main memory again, depending on if the L2 or L3 cache line size is the same or larger than the L1 cache line size.

Your loop is generally only benefiting from the cache by the cache line size as more data then one iteration is asking for is fetched by the hardware.

Some processors (Itanium) have a mechanism for the software to inform the hardware to try to perform cache line loads in advance. This can cover some of the cost of the main memory loads.

Since your loop never revisits any of the memory it touches, you don't get any further benefit from the cache. If you wanted to experiment further, you could consider:

Writing an outer loop that steps by some fixed stepping amount and some inner loops that use data in the stepping range repeatedly: So, a triply nested loop with the outer iterating by the fixed stepping amount and the first inner providing simple repetition (e.g. 100 iterations) and the inner most loop accessing the range of the array as per the stepping amount. Essentially we're going to access each memory location in the array 100 times, but change the order and grouping of how we access those locations, by varying the stepping size in different runs. This should show substantially more variation (orders of magnitude) between when the caches are working well and they are not.

With small stepping size the caches should work well and the 100 iterations over the each small range will benefit from greatly from cache reuse, and with large stepping sizes, performance should fall off dramatically, being closer to what you're observing with your straight run thru memory.

(All this assumes, of course, that the compiler doesn't mess with your code too much, sometimes overly simple benchmarks get optimized away...)

Related Solutions

C Memory – Understanding Addressable Memory Units

How can a bit be addressable?

The 8051/8052 family and some other microcontroller architectures (low and mid-range PIC, Infineon C16x/STM ST10) support bit-addressable addressing. In the 8051, the RAM bytes from 0x20 to 0x2f are bit addressable (128 bits in total). Also, many of the special function registers (SFR's) are bit addressable as well (in particular, those with byte addresses ending in 0 or 8 for the 8051 architecture).

8051 bit addressable memory

To support this, the 8051 has a number of instructions that operate directly on bits, such as set/clear/complement bit, jump if bit is set/not set, etc. These are much more efficient that loading up a general-purpose register and using AND or OR instructions.

For example, the instruction to set bit 3 of byte address 0x2A would be (the value 53h is found in the above table):

SETB 53h

which is a 2-byte, 1 cycle instruction.

Likewise, most C compilers for the 8051 (for example Keil C51) support a "bit" type in addition to standard C types like char, short, int etc. C statements referencing bit variables are compiled into code using the bit-manipulation assembly instructions.

So the code:

bit flag;
  .
  .
  .
flag = 1;

would compile into a single SETB instruction like the example above.

EDIT:

Regarding endianness on bit addressable machines, in general bits in a byte are labelled 7 to 0 (MSB to LSB), as in the diagram above. This is true whether the byte endianness is big endian like Motorola 6800, 68000 and PowerPC, or little endian like the 6502 family, Intel x86, and Amtel AVR.

Oddly, DEC minicomputers 50 years ago used the reverse notation: the PDP-8 numbered bits 0 to 11 (MSB to LSB). These machines were not bit-addressable however. DEC changed to the more common form used today when they came out with the PDP-11.

PDP-8 instruction format

C++ – strategies to keep tabs of memory leaks in non memory-managed languages

You've already covered them.

If shared ownership causes memory management issues, work your design away from shared ownership. Usually this involves designating one part of the code as the owner, and thus responsible for deallocation. Weak pointers (or the equivalent) can help this programmatically. RAII can be more liberally applied when single ownership exists.

If you have refcounting (or another mechanism for auto-deallocation) then you need not care if/where the deallocation occurs. You simply need to test/prove that the structure will correctly deallocate memory when it is supposed to.

A few other things to focus on:

Memory management likes patterns. When your code follows consistent patterns, it will have consistent behavior as far as memory management is concerned. When you do detect memory leaks, you can then focus on the exceptional cases, making them easier to find.
Shrink your scope. If the objects only live within an isolated scope, it's easier to reason about ownership and their lifetimes.
Only allocate what you need. Better than small scopes is no scope. If you don't need to allocate memory (within reason), don't.

Best Answer

Related Solutions

C Memory – Understanding Addressable Memory Units

C++ – strategies to keep tabs of memory leaks in non memory-managed languages

Related Topic