Electronic – STM32F4 : Floating point instructions too slow

clock-speedfloating pointnucleostm32stm32f4

I'm working on an audio application on the Nucleo F411RE and I've noticed that my processing was too slow, making the application skip some samples.

Looking at my disassembly I figured given the number of instructions and the 100 MHz CPU clock (that I've set in STM32CubeMx), it should be a lot faster.

I checked SYSCLK value and it is 100Mhz as expected. To be 100% sure I put 1000 "nop" in my main loop and measured 10 µs, which does correspond to a 100 MHz clock.

I measured exactly the time taken by my processing and it takes 14.5 µs ie 1450 clock cycles. I think it's way too much, considering that the processing is pretty simple :

for(i=0; i<12; i++)
{
    el1.mode[i].phase += el1.osc[i].phaseInc;  // 16 µs
    if(el1.osc[i].phase >= 1.0) // 20 µs (for the whole "if"
        el1.osc[i].phase -= 1.0; 
    el1.osc[i].value = sine[ (int16_t)(el1.osc[i].phase * RES) ]; // 96 µs
    el1.val += el1.osc[i].value * el1.osc[i].amp; // 28 µs
} // that's a total of 1.63 µs for the whole loop

where phase and phaseInc are single precision floats, and value is an int16_t, sine[] is a look up table containing 1024 int16_t.

It shouldn't be more than like 500 cycles, right? I looked at the disassembly, it does use the floating point instructions…
For example, the last line disassembly is :
vfma.f32 => 3 cycles
vcvt.s32.f32 => 1 cycle
vstr => 2 cycles
ldrh.w => 2 cycles

(cycles timing according to this ) So that's a total of 8 instruction for that line, which is the "biggest".
I don't really get why it's so slow… Maybe because I'm using structures or something?

If anybody has an idea, I'd be glad to hear it.

EDIT : I just measured the time line by line, you can see it in the code above. It seems like the most time consumming line is the look up table line, which would mean that it's memory access time that is critical? how could I improve that?

EDIT2: disassembly, as requested by BruceAbott (sorry it's a bit messy, probably because of the way it was optimized by the compiler):

membrane1.mode[i].phase += membrane1.mode[i].phaseInc;
0800192e:   vldr s14, [r5, #12]
08001932:   vldr s15, [r5, #8]
08001936:   vadd.f32 s15, s15, s14
0800193a:   adds r5, #24
179 if(membrane1.mode[i].phase >= 1.0)
0800193c:   vcmpe.f32 s15, s16
08001940:   vmrs APSR_nzcv, fpscr
180 membrane1.mode[i].phase -= 1.0;
08001944:   itt ge
08001946:   vmovge.f32 s14, #112    ; 0x70
0800194a:   vsubge.f32 s15, s15, s14
0800194e:   vstr s15, [r5, #-16]
182 membrane1.mode[i].value = sine[(int16_t)(membrane1.mode[i].phase * RES)];
08001952:   ldr.w r0, [r5, #-16]
08001956:   bl 0x80004bc <__extendsfdf2>
0800195a:   ldr r3, [pc, #112]      ; (0x80019cc <main+428>)
0800195c:   movs r2, #0
0800195e:   bl 0x8000564 <__muldf3>
08001962:   bl 0x8000988 <__fixdfsi>
08001966:   ldr r3, [pc, #104]      ; (0x80019d0 <main+432>)
184 membrane1.val += membrane1.mode[i].value * membrane1.mode[i].amp;
08001968:   vldr s13, [r5, #-4]
182 membrane1.mode[i].value = sine[(int16_t)(membrane1.mode[i].phase * RES)];
0800196c:   sxth r0, r0
0800196e:   ldrh.w r3, [r3, r0, lsl #1]
08001972:   strh.w r3, [r5, #-8]
184 membrane1.val += membrane1.mode[i].value * membrane1.mode[i].amp;
08001976:   sxth r3, r3
08001978:   vmov s15, r3
0800197c:   sxth r3, r4
0800197e:   vcvt.f32.s32 s14, s15
08001982:   vmov s15, r3
08001986:   vcvt.f32.s32 s15, s15
174 for(i=0; i<12; i++) // VADD.F32 : 1 cycle
0800198a:   cmp r5, r6
184 membrane1.val += membrane1.mode[i].value * membrane1.mode[i].amp;
0800198c:   vfma.f32 s15, s14, s13
08001990:   vcvt.s32.f32 s15, s15
08001994:   vstr s15, [sp, #4]
08001998:   ldrh.w r4, [sp, #4]
0800199c:   bne.n 0x800192e <main+270>

Best Answer

In your disassembly we see calls to 64 bit (double precision) math functions:-

08001956:   bl 0x80004bc <__extendsfdf2>
...
0800195e:   bl 0x8000564 <__muldf3>
08001962:   bl 0x8000988 <__fixdfsi>

The STM32F4 only supports 32 bit floating point in hardware, so these functions must be done in software and will take many cycles to execute. To ensure that all calculations are done in 32 bit you should define all your floating point numbers (including constants) as type float.