Stp aarch64 instruction must be used with “non-contiguous pair of registers”

armarm64assemblycpu-registers

The aarch64 architecture doesn't have instructions for multiple store and load, i.e. there are no equivalents of stm and ldm from armv7 arch. Instead you must use the stp and ldp instructions for store and loading pairs of registers.

Accroding to the ARM reference manual:

http://www.element14.com/community/servlet/JiveServlet/previewBody/41836-102-1-229511/ARM.Reference_Manual.pdf

There are no multiple register LDM, STM, PUSH and POP instructions, but load-store of a non-> contiguous pair of registers is available.

My question is, what does non-contiguous mean or refer to here? My instant reaction was that it means you can't use consecutively numbered registers with these commands, e.g.

stp x0, x1, [sp, #-16]!

is illegal. However I don't believe this is the case. I've seen example code doing exactly this and furthermore I've managed to get (Apple's) Clang to generate similar code, e.g.

stp x1, x0, [fp, #-16]!

I can't for the life of me think what contiguous then means. I thought it could be something to do with using overlapping registers, e.g.

stp x0, x0, [sp, #-16]!
stp w0, x0, [sp, #-12]!

However I've seen example code doing this sort of things as well (not to say that code was correct!). Also I would have explicitly used the terminology overlapping rather than contiguous if this were the case.

Any ideas?

Best Answer

It is primarily highlighting the contrast with the A32 (ARM) LDRD/STRD instructions*, which can only load a consecutive pair of registers, the lowest of which must be even-numbered, i.e.:

LDRD r0, r1, [sp]   @ OK
LDRD r0, r7, [sp]   @ <Rt> and <Rt2> are non-contiguous: invalid
LDRD r3, r4, [sp]   @ Contiguous but <Rt> odd-numbered: invalid

[This is down to the fact that there's only space to encode one target register in the instruction, so the architecture must have a defined way of inferring the second target register.]

In contrast, the A64 LDP/STP encodings have room to encode two target registers, which means they can be any two registers in any order, i.e. they are allowed to be non-contiguous - it's a permission, not a restriction.

Note that that particular document is obsolete since the release of the full ARMv8 ARM, which has proper detailed instruction pages that should be slightly less ambiguous.

_{* The T32 (Thumb) encodings don't have this restriction, since the lack of a condition predicate means there's space to encode the second target register, much like A64.}

accessing unaligned memory

The only portable C standard solution to access unaligned memory is the memcpy one. I was hoping to get another one through this question, but apparently it's the only one found so far.

Example code :

u32 read32(const void* ptr)  { 
    u32 value; 
    memcpy(&value, ptr, sizeof(value)); 
    return value;  }

This solution is safe in all circumstances. It also compiles into a trivial load register operation on x86 target using GCC.

However, on ARM target using GCC, it translates into a way too large and useless assembly sequence, which bogs down performance.

Using Clang on ARM target, memcpy works fine (see @notlikethat comment below). It would be easy to blame GCC at large, but it's not that simple : the memcpy solution works fine on GCC with x86/x64, PPC and ARM64 targets. Lastly, trying another compiler, icc13, the memcpy version is surprisingly heavier on x86/x64 (4 instructions, while one should be enough). And that's just the combinations I could test so far.

I have to thank godbolt's project to make such statements easy to observe.

The second solution is to use __packed structures. This solution is not C standard, and entirely depends on compiler's extension. As a consequence, the way to write it depends on the compiler, and sometimes on its version. This is a mess for maintenance of portable code.

That being said, in most circumstances, it leads to better code generation than memcpy. In most circumstances only ...

For example, regarding the above cases where memcpy solution does not work, here are the findings :

on x86 with ICC : __packed solution works
on ARMv7 with GCC : __packed solution works
on ARMv6 with GCC : does not work. Assembly looks even uglier than memcpy.
1. The last solution is to use direct u32 access to unaligned memory positions. This solution used to work for decades on x86 cpus, but is not recommended, as it violates some C standard principles : compiler is authorized to consider this statement as a guarantee that data is properly aligned, leading to buggy code generation.

Unfortunately, in at least one case, it is the only solution able to extract performance from target. Namely for GCC on ARMv6.

Do not use this solution for ARMv7 though : GCC can generate instructions which are reserved for aligned memory accesses, namely LDM (Load Multiple), leading to crash.

Even on x86/x64, it becomes dangerous to write your code this way nowadays, as the new generation compilers may try to auto-vectorize some compatible loops, generating SSE/AVX code based on the assumption that these memory positions are properly aligned, crashing the program.

As a recap, here are the results summarized as a table, using the convention : memcpy > packed > direct.

| compiler  | x86/x64 | ARMv7  | ARMv6  | ARM64  |  PPC   |
|-----------|---------|--------|--------|--------|--------|
| GCC 4.8   | memcpy  | packed | direct | memcpy | memcpy |
| clang 3.6 | memcpy  | memcpy | memcpy | memcpy |   ?    |
| icc 13    | packed  | N/A    | N/A    | N/A    | N/A    |

Best Answer

Related Solutions

Take advantage of ARM unaligned memory access while writing clean C code

accessing unaligned memory

Related Topic