C++ – Why Accessing Elements of a Huge Dynamically Allocated Structure is Slower

I am doing C++ programming in Ubuntu and really care about the efficiency of my code. The computer that I work with has 32 GB of RAM and the compilation is done with the C++11 option. I noticed that for a very large dynamically allocated array such as my_array_1 in the following code, accessing elements occur significantly slower than a [relatively] small arrays such as my_array_2. I tested that with structures, but I would suspect this is true for any type of large variables (??). See this code as an example:

#define NT 100000

typedef struct {
  float ind_1[4096];
  float ind_2[4096];
  int n;
} ind_vec; // 32 KB

// .....

ind_vec *my_array_1; // a huge struct
int *my_array_2; // a small vector

my_array_1 = new ind_vec[NT]; // about 3 GB
my_array_2 = new int[NT]; // about 400 KB

for(int i = 0; i<100; i++){ // This loop is slow!
  // I don't involve ind_1 and ind_2 for now
  my_array_1[i].n = 1; 
}

for(int j = 0; j<100; j++){ // This loop is fast!
  my_array_2[j] = 1;
}

delete[] my_array_1;
delete[] my_array_2;

As I indicated in the code, the 1st loop is much slower than the 2nd one (on the order of 1000 times). Exact timing of each loop was done through a simple function using gettimeofday (not shown here)

Q1) I'd assume that these two loops both effectively do the same job (form the computer's perspective) through the same approach. Is the performance difference because my_array_2 is allocated on heap while my_array_1 is perhaps allocated somewhere else (I don't know where)?

Q2) Is there any workaround here?

Best Answer

You have some very good answers on the topic here

Generally, your struct is probably too big for the CPU cache, so probably parts of it end up in L2 cache or in RAM memory, which is significantly slower than L1 cache, hence performance issues. You might want to try and do some profiling and find out exactly what is going on. If you do, I would very much like to read the results.

If you are striving for performance, ask yourself why do you need a struct with two arrays of that size? Could you achieve similar performance if you just had int pointers, and then allocate the arrays dynamically as needed? Could you simply lose the struct and handle the members independently? I know the last approach is very ugly, but when performance is the ultimate request, some sacrifices in code readability must be made.