Cycles per instruction in delay loop on arm

armassemblyc

I'm trying to understand some assembler generated for the stm32f103 chipset by arm-none-eabi-gcc, which seems to be running exactly half the speed I expect. I'm not that familiar with assembler but since everyone always says read the asm if you want to understand what your compiler is doing I am seeing how far I get. Its a simple function:

void delay(volatile uint32_t num) { 
    volatile uint32_t index = 0; 
    for(index = (6000 * num); index != 0; index--) {} 
}

The clock speed is 72MHz and the above function gives me a 1ms delay, but I expect 0.5ms (since (6000*6)/72000000 = 0.0005).

The assembler is this:

delay:
        @ args = 0, pretend = 0, frame = 16
        @ frame_needed = 0, uses_anonymous_args = 0
        @ link register save eliminated.
        sub     sp, sp, #16         stack pointer = stack pointer - 16
        movs    r3, #0              move 0 into r3 and update condition flags
        str     r0, [sp, #4]        store r0 at location stack pointer+4
        str     r3, [sp, #12]       store r3 at location stack pointer+12 
        ldr     r3, [sp, #4]        load r3 with data at location stack pointer+4 
        movw    r2, #6000           move 6000 into r2 (make r2 6000)
        mul     r3, r2, r3          r3 = r2 * r3
        str     r3, [sp, #12]       store r3 at stack pointer+12
        ldr     r3, [sp, #12]       load r3 with data at stack pointer+12
        cbz     r3, .L1             Compare and Branch on Zero
.L4:
        ldr     r3, [sp, #12]   2   load r3 with data at location stack pointer+12
        subs    r3, r3, #1      1   subtract 1 from r3 with 'set APSR flag' if any conditions met
        str     r3, [sp, #12]   2   store r3 at location sp+12 
        ldr     r3, [sp, #12]   2   load r3 with data at location sp+12
        cmp     r3, #0          1   status = 0 - r3 (if r3 is 0, set status flag)
        bne     .L4             1   branch to .L4 if not equal
.L1:
        add     sp, sp, #16         add 16 back to the stack pointer
        @ sp needed
        bx      lr
        .size   delay, .-delay
        .align  2
        .global blink
        .thumb
        .thumb_func
        .type   blink, %function

I've commented what I believe each instruction means from looking it up. So I believe the .L4 section is the loop of the delay function, which is 6 instructions long. I do realise that clock cycles are not always the same as instructions but since theres such a large difference, and since this is a loop which I imagine is predicted and pipelined efficiently, I am wondering if theres a solid reason that I am seeing 2 clock cycles per instruction.

Background:
In the project I am working on I need to use 5 output pins to control a linear ccd, and the timing requirements are said to be fairly tight. Absolute frequency will not be maxed out (I will clock the pins slower than the cpu is capable of) but pin timings relative to each other are important. So rather than use interupts which are at the limit of my ability and might complicate relative timings I am thinking use loops to provide the short delays (around 100 ns) between pin voltage change events, or even code the whole section in unrolled assembler since I have plenty of program storage space. There is a period when the pins are not changing during which I can run the ADC to sample the signal.

Although the odd behaviour I am asking about is not a show stopper I would rather understand it before proceeding.

Edit: From comment, the arm tech ref gives instruction timings. I have added them to the assembly. But its still only a total of 9 cycles rather than the 12 I expect. Is the jump a cycle itself?

TIA, Pete

Think I have to give this one to ElderBug although Dwelch raised some points which might also be very relevant so thanks to all. Going from this I will try using unrolled assembly to toggle the pins which are 20ns apart in their changes and then return back to C for the longer waits, and ADC conversion, then back to assembly to repeat the process, keeping an eye on the assembly output from gcc to get a rough idea of whether my timings look OK. BTW Elder the modified wait_cycles function does work as expected as you said. Thanks again.

Best Answer

First, doing a spin-wait loop in C is a bad idea. Here I can see that you compiled with -O0 (no optimizations), and your wait will be much shorter if you enable optimizations (EDIT: Actually maybe the unoptimized code you posted just results from the volatile, but it doesn't really matter). C wait loops are not reliable. I maintained a program that relied on a function like that, and each time we had to change a compiler flag, the timings were messed (fortunately, there was a buzzer that went out of tune as a result, reminding us to change the wait loop).

About why you don't see 1 instruction per cycle, it is because some instructions don't take 1 cycle. For example, bne can take additional cycles if the branch is taken. The problem is that you can have less deterministic factors, like bus usage. Accessing the RAM means using the bus, that can be busy fetching data from ROM or in use by a DMA. This means instructions like STR and LDR may be delayed. On your example, you have a STR followed by a LDR on the same location (typical of -O0); if the MCU doesn't have store-to-load forwarding, you can have a delay.

What I do for timings is using a hardware timer for delay above 1µs, and a hard-coded assembly loop for the really short delays.

For the hardware timer, you just have to setup a timer at a fixed frequency (with period < 1µs if you want delay accurate at 1µs), and use some simple code like that :

void wait_us( uint32_t us ) {
    uint32_t mark = GET_TIMER();
    us *= TIMER_FREQ/1000000;
    while( us > GET_TIMER() - mark );
}

You can even use mark as a parameter to set it before some task, and use the function to wait for the remaining time after. Example :

uint32_t mark = GET_TIMER();
some_task();
wait_us( mark, 200 );

For the assembly wait, I use this one for ARM Cortex-M4 (close to yours) :

#define CYCLES_PER_LOOP 3
inline void wait_cycles( uint32_t n ) {
    uint32_t l = n/CYCLES_PER_LOOP;
    asm volatile( "0:" "SUBS %[count], 1;" "BNE 0b;" :[count]"+r"(l) );
}

This is very short, precise, and won't be affected by compiler flags nor bus load. You may have to tune the CYCLES_PER_LOOP, but I think it will the same value for your MCU (here it is 1+2 for SUBS+BNE).

Related Solutions

What’s the purpose of the LEA instruction

As others have pointed out, LEA (load effective address) is often used as a "trick" to do certain computations, but that's not its primary purpose. The x86 instruction set was designed to support high-level languages like Pascal and C, where arrays—especially arrays of ints or small structs—are common. Consider, for example, a struct representing (x, y) coordinates:

struct Point
{
     int xcoord;
     int ycoord;
};

Now imagine a statement like:

int y = points[i].ycoord;

where points[] is an array of Point. Assuming the base of the array is already in EBX, and variable i is in EAX, and xcoord and ycoord are each 32 bits (so ycoord is at offset 4 bytes in the struct), this statement can be compiled to:

MOV EDX, [EBX + 8*EAX + 4]    ; right side is "effective address"

which will land y in EDX. The scale factor of 8 is because each Point is 8 bytes in size. Now consider the same expression used with the "address of" operator &:

int *p = &points[i].ycoord;

In this case, you don't want the value of ycoord, but its address. That's where LEA (load effective address) comes in. Instead of a MOV, the compiler can generate

LEA ESI, [EBX + 8*EAX + 4]

which will load the address in ESI.

Sqlite – Improve INSERT-per-second performance of SQLite

Several tips:

Put inserts/updates in a transaction.
For older versions of SQLite - Consider a less paranoid journal mode (pragma journal_mode). There is NORMAL, and then there is OFF, which can significantly increase insert speed if you're not too worried about the database possibly getting corrupted if the OS crashes. If your application crashes the data should be fine. Note that in newer versions, the OFF/MEMORY settings are not safe for application level crashes.
Playing with page sizes makes a difference as well (PRAGMA page_size). Having larger page sizes can make reads and writes go a bit faster as larger pages are held in memory. Note that more memory will be used for your database.
If you have indices, consider calling CREATE INDEX after doing all your inserts. This is significantly faster than creating the index and then doing your inserts.
You have to be quite careful if you have concurrent access to SQLite, as the whole database is locked when writes are done, and although multiple readers are possible, writes will be locked out. This has been improved somewhat with the addition of a WAL in newer SQLite versions.
Take advantage of saving space...smaller databases go faster. For instance, if you have key value pairs, try making the key an INTEGER PRIMARY KEY if possible, which will replace the implied unique row number column in the table.
If you are using multiple threads, you can try using the shared page cache, which will allow loaded pages to be shared between threads, which can avoid expensive I/O calls.
Don't use !feof(file)!

I've also asked similar questions here and here.

Best Answer

Related Solutions

What’s the purpose of the LEA instruction

Sqlite – Improve INSERT-per-second performance of SQLite

Related Topic