Memory Alignment – Importance and Relevance Today

cmemory usageoptimizationspeed

From some time now, I have searched and read a lot about memory alignment, how it work and how to use it. The most relevant article I have find for now is this one.

But even with that I still have some questions about it:

Out of embedded system, we often have huge chunk of memory in our computer that make memory management a lot less critic, I am completly into optimization, but now, is it really something that can make the difference if we compare the same program with or without it's memory rearranged and aligned?
Is memory alignment have other advantages? I read somewhere that CPU work better/faster with aligned memory because that take it less instructions to process (if one of you have a link for an article/benchmark about it?), in that case, is the difference really significant? Is there more advantages than these two?
In the article link, at chapter 5, the author say:

Beware: in C++, classes that look like structs may break this rule! (Whether they do or not depends on how base classes and virtual member functions are implemented, and varies by compiler.)
The article talk mostly about structures, but are local variables declaration is also affected by this need?

Have you any idea of how memory alignment work exactly in C++ since it seem to have some differences?

This former question contains the word "alignment", but it does not provide any answers to the questions above.

Best Answer

Yes both alignment and arrangement of your data can make a big difference in performance, not just a few percent but few to many hundreds of a percent.

Take this loop, two instructions matter if you run enough loops.

.globl ASMDELAY
ASMDELAY:
    subs r0,r0,#1
    bne ASMDELAY
    bx lr

With and without cache, and with alignment with and without cache toss in branch prediction and you can vary those two instructions performance by a significant amount (timer ticks):

min      max      difference
00016DDE 003E025D 003C947F

A performance test you can very easily do yourself. add or remove nops around the code under test and do an accurate job of timing, move the instructions under test along a wide enough range of addresses to touch the edges of cache lines, etc.

Same kind of thing with data accesses. Some architectures complain about unaligned accesses (performing a 32 bit read at address 0x1001 for example), by giving you a data fault. Some of those you can disable the fault and take the performance hit. Others that allow unaligned accesses you just get the performance hit.

It is sometimes "instructions" but most of the time it is clock/bus cycles.

Look at the memcpy implementations in gcc for various targets. Say you are copying a structure that is 0x43 bytes, you may find an implementation that copies one byte leaving 0x42 then copies 0x40 bytes in large efficient chunks then the last 0x2 it may do as two individual bytes or as a 16 bit transfer. Alignment and target come into play if source and destination addresses are on the same alignment say 0x1003 and 0x2003, then you could do the one byte, then 0x40 in big chunks then 0x2, but if one is 0x1002 and the other 0x1003, then it gets real ugly and real slow.

Most of the time it is bus cycles. Or worse the number of transfers. Take a processor with a 64 bit wide data bus, like ARM, and do a four word transfer (read or write, LDM or STM) at address 0x1004, that is a word aligned address, and perfectly legal, but if the bus is 64 bits wide it is likely that the single instruction will turn into three transfers in this case a 32 bit at 0x1004, a 64 bit at 0x1008 and a 32 bit at 0x100A. But if you had the same instruction but at address 0x1008 it could do a single four word transfer at address 0x1008. Each transfer has a setup time associated. So the 0x1004 to 0x1008 address difference by itself can be several times faster, even/esp when using a cache and all are cache hits.

Speaking of, even if you do a two word read at address 0x1000 vs 0x0FFC, the 0x0FFC with cache misses is going to cause two cache line reads where 0x1000 is one cache line, you have the penalty of a cache line read anyway for a random access (reading more data than using) but then that doubles. How your structures are aligned or your data in general and your frequency of accessing that data, etc, can cause cache thrashing.

You can end up striping your data such that as you process the data you can create evictions, you could get real unlucky and end up using only a fraction of your cache and as you jump through it the next blob of data collides with a prior blob. By mixing up your data or re-arranging functions in the source code, etc you can create or remove collisions, since not all caches are created equal the compiler isnt going to help you here it is on you. Even detecting the performance hit or improvement is on you.

All the things we have added to improve performance, wider data busses, pipelines, caches, branch prediction, multiple execution units/paths, etc. Will most often help, but they all have weak spots, that can be exploited either intentionally or accidentally. There is very little the compiler or libraries can do about it, if you are interested in performance you need to tune and one of the biggest tuning factors is alignment of the code and the data, not just aligned on 32, 64, 128, 256 bit boundaries, but also where things are relative to each other, you want heavily used loops or re-used data to not land in the same cache way, they each want their own. Compilers can help for example ordering of instructions for a super scalar architecture, re-arranging instructions that relative to each other dont matter, can make a big performance gain or hit if you are not efficiently using the execution paths, but you have to tell the compiler what you are running on.

The biggest oversight is the assumption that the processor is the bottleneck. Has not been true for a decade or more, feeding the processor is the problem and that is where issues like alignment performance hits, cache thrashing, etc come into play. With a little work even at the source code level, re-arranging data in a structure, ordering of variable/struct declarations, ordering of functions within the source code, and a little extra code to align data, can improve performance several times over or more.

Related Solutions

Optimization vs Scalability – Key Differences Explained

Optimization is the process of making existing something more efficient, while doing the same operations. This can mean take less processing time for the same operation. It could mean taking less memory. It could mean taking less disk space. Optimization sometimes involves making trade-offs, like using more memory in exchange for faster processing.

Making things scalable means doing more of the same things. Further, it means doing more of the same things in an efficient manner. Searching for some item in your pocket is quick and easy. Searching for that same item in your house is harder. Searching for that same item when it could be anywhere in the world is super hard. Scalability is how drastically that sort of thing gets harder. "More" or better scalability means that searching all over the world is relatively only a little harder than searching in your pocket.

Like optimization, scalability can have a variety of targets. Sometimes the scalability target is processing time, like the search example. Sometimes it is database size. Sometimes it is memory consumption. Sometimes it's as simple as being able to show more stuff on the screen without driving your users nuts.

So in short, optimization is the act of making a single operation more efficient. Making things scalable is making that operation capable of doing more in a more efficient way.

Or if you could think of how an operation behaves for a series of inputs as a curve: optimization is lowering one point of the curve; making something scalable is lowering the slope of the curve.

C Memory Usage – How Variable Assignment Works

How could it be possible to use a single n box as my tutor taught? Won't a new n box be created whenever factorial(n - 1) is called?

Yes, exactly. Each time you call a function, additional memory is set aside for the function arguments and local variables.

Let's drop a couple of levels below C and talk about how a compiled program is laid out in memory¹. We're going to assume a 32-bit, x86-like system for this discussion.

While the details vary from platform to platform, the general concepts are the same for most systems you're likely to work on. When a compiled program is loaded into memory, it's usually divided up into several logical segments; one segment contains the machine code, another contains constant (read-only) data items, another contains space for non-constant, global data items, another contains space for the stack, another contains the space for the heap, etc. The general layout looks something like the following (taken from this page):

              +------------------------+ 
high address  | Command line arguments |   
              | and environment vars   |  
              +------------------------+
              |         stack          |
              | - - - - - - - - - - -  |
              |           |            |
              |           V            |
              |                        |
              |           ^            |
              |           |            |
              | - - - - - - - - - - -  |
              |          heap          |
              +------------------------+
              |    global and read-    |
              |       only data        |
              +------------------------+
 low address  |     program text       |
              |    (machine code)      |
              +------------------------+

As your program runs, it uses the stack to store function arguments and local variables, along with some program state (previous frame pointer, address to the next instruction). On x86-like systems, the stack grows "downwards", towards decreasing addresses. Among the registers² provided, there are usually two to keep track of what's on the stack. There's the stack pointer, which points to the most recent item on the stack, and the frame pointer³, which points to the current stack frame. A stack frame is a region of the stack that contains the arguments and local variables for the current function, along with any saved program state:

             + - - - - - - - - -  +
high address |  previous stack    |
             |       frame        |
             + - - - - - - - - -  +
             |  previous stack    |
             |       frame        |
             +--------------------+  ----------------------+
             | function arguments |                        |
             |        ...         |                        |
             +--------------------+                        |
             |   return address   |                        |
             +--------------------+                        +-- current stack frame
             | old frame pointer  | <-- frame pointer      |
             +--------------------+                        |
             |  local variables   |                        |
             |        ...         | <-- stack pointer      |
             +--------------------+ -----------------------+  
 low address |  available memory  |       
             + - - - - - - - - -  +

On the systems I'm familiar with, the function arguments and program state are stored "above" the location stored in the frame pointer, and local variables are stored "below" it. Within the generated assembly code, arguments and variables are referred to by their offset from the address stored in the frame pointer. On x86. it would look something like

cmpl  $1, 8(%ebp)   ;; compare the literal value 1 to the first function argument
movl  $1, -4(%ebp)  ;; write the literal value 1 to the first local variable

The first function argument is stored 8 bytes "above" the address value stored in the frame pointer %ebp, while the first local variable would be 4 bytes "below" it (note that this assumes a 32-bit system).

The exact size of the frame depends on how many arguments and local variables need to be stored, although the platform may also require it to be aligned to certain boundary (IOW, the size of the frame must be a multiple of some number of bytes, usually 4 or 8, sometimes 16; for the purposes of this discussion, we're not going to worry about alignment).

Suppose your factorial function is called with an argument value of 3:

x = factorial( 3 );

A new stack frame will be set up that looks something like this:

 +--------------------+
 |         3          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | <-- frame pointer, stack pointer
 +--------------------+

where the "return address" is the address of the instruction immediately following the function call in the caller (in this case, it would be the instruction that assigns the result of factorial(3) to x).

Since there are no local variables, no space is allocated for them in the stack frame. Now, since n is not less than 2, you execute the line

n = n * factorial( n - 1 );

This means factorial is called again, this time with an argument value of 2. A new stack frame is created

 +--------------------+
 |         3          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | 
 +--------------------+
 |         2          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | <-- frame pointer, stack pointer
 +--------------------+

n is still not less than 2, so we call factorial again with an argument of 1:

 +--------------------+
 |         3          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | 
 +--------------------+
 |         2          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | 
 +--------------------+
 |         1          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | <-- frame pointer, stack pointer
 +--------------------+

1 is less than 2, so we execute the line

return n;

The common convention for returning values from functions is to write the value to a particular register. On x86, integer values are returned through register %eax, so the compiled code will do something like

movl  8(%ebp), %eax

Now register %eax will contain the value 1.

When the function returns, the stack and frame pointers are restored to their previous values (the stack contents are left as-is):

 +--------------------+
 |         3          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | 
 +--------------------+
 |         2          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | <-- frame pointer, stack pointer
 +--------------------+
 |         1          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | 
 +--------------------+

So now we need to execute the remainder of

n = n * factorial( n - 1 );

which will look something like⁴

movl %eax, %edx     ;; save the value returned from factorial(1) to register %edx
movl 8(%ebp), %eax  ;; move the value of n (2) to %eax
imull %eax, %edx    ;; multiply the values in %eax and %edx, save the result in
                    ;; %eax

Once again, we return n and exit the function, bringing us back to the previous call:

 +--------------------+
 |         3          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | <-- frame pointer, stack pointer
 +--------------------+
 |         2          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | 
 +--------------------+
 |         1          | <-- n
 +--------------------+
 |   return address   |
 +--------------------+
 |  old frame pointer | 
 +--------------------+

Lather, rinse, repeat. The value returned in %eax (2) is multiplied by the value in the argument 8(%ebp) (3), and we write 6 back to %eax.

Now, the C language definition says nothing about stacks, or stack frames, or registers, or anything like that; it only specifies the behavior of variables within a program, not how to implement that behavior. However, much of C's design is based on what most computer systems already provide (this is what people mean when they say that C is "close to the hardware"), so this is a very natural way of implementing that behavior. If you were working on a machine that had no stack, C programs would be somewhat more difficult to implement.

^{1. By which I mean a virtual address space, which is an idealized model of memory that your program sees; the OS and memory controllers will map this idealized model to physical RAM.

2. A register is a memory cell on the CPU itself. There are usually several special-purpose registers (program counter, stack pointer, frame pointer) and some number of general-purpose registers. Reading and writing to registers is faster than reading and writing to RAM.

3. On 32-bit x86, the stack pointer register is esp and the frame pointer register is ebp (a.k.a. the base pointer). On 64-bit x86, they're named rsp and rbp.

4. This is how the compiled code on my system does it, anyway.}

Best Answer

Related Solutions

Optimization vs Scalability – Key Differences Explained

C Memory Usage – How Variable Assignment Works

Related Topic