What are the .w and .n suffixes added to arm assembly instructions

armassembly

We are inside write_four_registers_and_readback() and next instruction is to call delay(10):

004002a4:   b.n 0x4002f4 <write_four_registers_and_readback+172>
 78               delay(10);

From the ARM Instruction Set we learn that b is branch, followed by a two letter mnemonic

Example:

CMP R1,#0 ; Compare R1 with zero and branch to fred
          ; if R1 was zero, otherwise continue
BEQ fred  ; to next instruction.

But this .n doesn't seem to be included in the table of two-letter mnemonics … frankly it is not a two-letter mnemonic either. What does it mean.

Furthermore, what means the number 0x4002f4? Is it just an absolute representation of the address put in <>? Or something else – point 4.4.3 Assembler syntax doesn't seem to explain.

Device is SAM4S and toolchain is arm-none-eabi-gcc.

Best Answer

You have pretty ancient reference material there, but a modern toolchain. When Thumb-2 was introduced, ARM also introduced a new Unified Assembler Language which allows writing code which can be assembled to either ARM or Thumb with (potentially) no modification. Most, if not all, recent toolchains default to UAL (although many still support the legacy syntaxes by some means).

In UAL, the .n and .w suffixes apply to mnemonics which have both a 16-bit Thumb ("narrow") and 32-bit Thumb-2 ("wide") encoding^*. If unspecified, the assembler will automatically choose an appropriate encoding, but if a particular encoding is desired (e.g. forcing a 32-bit encoding where the assembler would choose the 16-bit version in order to pad code for cache alignment), then the suffix can be specified.

Disassemblers will generally emit the suffixes to ensure that the output, if passed back through an assembler, will result in the exact same object code.

So, what you have there is a 16-bit Thumb branch instruction to address 0x4002f4, which happens to be 172 bytes past the symbol write_four_registers_and_readback.

_{* Technically, all ARM encodings are also "wide", and thus may take a .w suffix, but it's entirely redundant in that case.}

Related Solutions

What are SP (stack) and LR in ARM

LR is link register used to hold the return address for a function call.

SP is stack pointer. The stack is generally used to hold "automatic" variables and context/parameters across function calls. Conceptually you can think of the "stack" as a place where you "pile" your data. You keep "stacking" one piece of data over the other and the stack pointer tells you how "high" your "stack" of data is. You can remove data from the "top" of the "stack" and make it shorter.

From the ARM architecture reference:

SP, the Stack Pointer

Register R13 is used as a pointer to the active stack.

In Thumb code, most instructions cannot access SP. The only instructions that can access SP are those designed to use SP as a stack pointer. The use of SP for any purpose other than as a stack pointer is deprecated. Note Using SP for any purpose other than as a stack pointer is likely to break the requirements of operating systems, debuggers, and other software systems, causing them to malfunction.

LR, the Link Register

Register R14 is used to store the return address from a subroutine. At other times, LR can be used for other purposes.

When a BL or BLX instruction performs a subroutine call, LR is set to the subroutine return address. To perform a subroutine return, copy LR back to the program counter. This is typically done in one of two ways, after entering the subroutine with a BL or BLX instruction:

• Return with a BX LR instruction.

• On subroutine entry, store LR to the stack with an instruction of the form: PUSH {,LR} and use a matching instruction to return: POP {,PC} ...

This link gives an example of a trivial subroutine.

Here is an example of how registers are saved on the stack prior to a call and then popped back to restore their content.

Cycles per instruction in delay loop on arm

First, doing a spin-wait loop in C is a bad idea. Here I can see that you compiled with -O0 (no optimizations), and your wait will be much shorter if you enable optimizations (EDIT: Actually maybe the unoptimized code you posted just results from the volatile, but it doesn't really matter). C wait loops are not reliable. I maintained a program that relied on a function like that, and each time we had to change a compiler flag, the timings were messed (fortunately, there was a buzzer that went out of tune as a result, reminding us to change the wait loop).

About why you don't see 1 instruction per cycle, it is because some instructions don't take 1 cycle. For example, bne can take additional cycles if the branch is taken. The problem is that you can have less deterministic factors, like bus usage. Accessing the RAM means using the bus, that can be busy fetching data from ROM or in use by a DMA. This means instructions like STR and LDR may be delayed. On your example, you have a STR followed by a LDR on the same location (typical of -O0); if the MCU doesn't have store-to-load forwarding, you can have a delay.

What I do for timings is using a hardware timer for delay above 1µs, and a hard-coded assembly loop for the really short delays.

For the hardware timer, you just have to setup a timer at a fixed frequency (with period < 1µs if you want delay accurate at 1µs), and use some simple code like that :

void wait_us( uint32_t us ) {
    uint32_t mark = GET_TIMER();
    us *= TIMER_FREQ/1000000;
    while( us > GET_TIMER() - mark );
}

You can even use mark as a parameter to set it before some task, and use the function to wait for the remaining time after. Example :

uint32_t mark = GET_TIMER();
some_task();
wait_us( mark, 200 );

For the assembly wait, I use this one for ARM Cortex-M4 (close to yours) :

#define CYCLES_PER_LOOP 3
inline void wait_cycles( uint32_t n ) {
    uint32_t l = n/CYCLES_PER_LOOP;
    asm volatile( "0:" "SUBS %[count], 1;" "BNE 0b;" :[count]"+r"(l) );
}

This is very short, precise, and won't be affected by compiler flags nor bus load. You may have to tune the CYCLES_PER_LOOP, but I think it will the same value for your MCU (here it is 1+2 for SUBS+BNE).

Best Answer

Related Solutions

What are SP (stack) and LR in ARM

Cycles per instruction in delay loop on arm

Related Topic