OK, the situation is more confusing than one would like. So, in an effort to clarify, here are the findings on this journey :
accessing unaligned memory
- The only portable C standard solution to access unaligned memory is the
memcpy
one. I was hoping to get another one through this question, but apparently it's the only one found so far.
Example code :
u32 read32(const void* ptr) {
u32 value;
memcpy(&value, ptr, sizeof(value));
return value; }
This solution is safe in all circumstances. It also compiles into a trivial load register
operation on x86 target using GCC.
However, on ARM target using GCC, it translates into a way too large and useless assembly sequence, which bogs down performance.
Using Clang on ARM target, memcpy
works fine (see @notlikethat comment below). It would be easy to blame GCC at large, but it's not that simple : the memcpy
solution works fine on GCC with x86/x64, PPC and ARM64 targets. Lastly, trying another compiler, icc13, the memcpy version is surprisingly heavier on x86/x64 (4 instructions, while one should be enough). And that's just the combinations I could test so far.
I have to thank godbolt's project to make such statements easy to observe.
- The second solution is to use
__packed
structures. This solution is not C standard, and entirely depends on compiler's extension. As a consequence, the way to write it depends on the compiler, and sometimes on its version. This is a mess for maintenance of portable code.
That being said, in most circumstances, it leads to better code generation than memcpy
. In most circumstances only ...
For example, regarding the above cases where memcpy
solution does not work, here are the findings :
Unfortunately, in at least one case, it is the only solution able to extract performance from target. Namely for GCC on ARMv6.
Do not use this solution for ARMv7 though : GCC can generate instructions which are reserved for aligned memory accesses, namely LDM
(Load Multiple), leading to crash.
Even on x86/x64, it becomes dangerous to write your code this way nowadays, as the new generation compilers may try to auto-vectorize some compatible loops, generating SSE/AVX code based on the assumption that these memory positions are properly aligned, crashing the program.
As a recap, here are the results summarized as a table, using the convention : memcpy > packed > direct.
| compiler | x86/x64 | ARMv7 | ARMv6 | ARM64 | PPC |
|-----------|---------|--------|--------|--------|--------|
| GCC 4.8 | memcpy | packed | direct | memcpy | memcpy |
| clang 3.6 | memcpy | memcpy | memcpy | memcpy | ? |
| icc 13 | packed | N/A | N/A | N/A | N/A |
Best Answer
It is primarily highlighting the contrast with the A32 (ARM)
LDRD
/STRD
instructions*, which can only load a consecutive pair of registers, the lowest of which must be even-numbered, i.e.:[This is down to the fact that there's only space to encode one target register in the instruction, so the architecture must have a defined way of inferring the second target register.]
In contrast, the A64
LDP
/STP
encodings have room to encode two target registers, which means they can be any two registers in any order, i.e. they are allowed to be non-contiguous - it's a permission, not a restriction.Note that that particular document is obsolete since the release of the full ARMv8 ARM, which has proper detailed instruction pages that should be slightly less ambiguous.
* The T32 (Thumb) encodings don't have this restriction, since the lack of a condition predicate means there's space to encode the second target register, much like A64.