Electronic – Which microcontroller for a program with many floating point operations


I'm using the STM32 uC for quite a long time, from F1,F2,F3,F4 to F7. In one application I changed from the F4 (100 MHz) to the F7 (200 MHz), but this seems like it was a mistake.

The application run on the F4 with around 15kHz, on the F7 with around 12 kHz, although the F7 runs on double the clock speed. So it seems, that the two processors have different FPU architectures and as I read, the F4 has some parallelism for the FPU while the F7 can only do sequential operation.

So is it true, that for an application with heavy FPU load, an F4 outperforms an F7?


So I made some measurements on real hardware to verify my toughts:

Hardware: STM32F722RC vs STM32F412CE

Programm: Just some FPU operations as used in my application

  for(uint16_t i = 0; i < 2000; i++)
      if(x > 6)
          x = 0.1f;
      } else if (x < -6)
          x = -0.1f;
      x = x + 0.05f;
      x = x + sinf(x)*cosf(x);

  cyclic_time[ptr] = htim6.Instance->CNT;
  cyclic_time[ptr] /= 1e6f;
  if(ptr >= 20)
      ptr = 0;

Performance F4:
enter image description here

At 100Mhz:
enter image description here

–> So an average cycle time of around 7.365ms

Performance F7:
enter image description here

At 200Mhz:
enter image description here

At 100Mhz:
enter image description here

–> So an average cycle time of around 9.954ms @ 200Mhz best case (I verified, that in both cases the timer runs on the correct clock speed, 100Mhz and 99 Prescaler, such that the measurement is correct)

So that is exactly what I observed in my real application. Somehow the F4 outperformance the F7 when it comes to floating point operations.


Compiler Options F4:
enter image description here

Compiler Options F7:
enter image description here

To ensure, that the optimizer is not a problem, I tested the cycle time with the optimizer enabled on speed:

enter image description here
–> Around 6.44ms

enter image description here
–> Around 8.344ms

So this leads to the same problem.

Projects for F4 and F7:

Best Answer

Hm, until you do a bit of benchmarking to show that it's really the FPU, I'd heavily doubt this is about the Cortex-M7F FPU being slower (it's really not, never seen that).

Generally, try to make sure you're not inadvertedly doing something like soft-Floating Point math (-mfloat-abi=soft), or aren't using math libraries that have been optimized for the STM32F4, but not the F7. Make sure you're compiling for ARMv7-M or ARMv7EM.

The fact that you're putting a processing rate to this: This sounds like a DSP workload. So, make sure you're really using the DSP instructions: both the M4 and the M7 should have single-cycle Multiply-accumulates, so a 200 MHz M7 should in any case be twice as fast as a 100 MHz M4 if these are used. Your compiler should infer these, but sometimes a bit of hand-assembly pays.

So, either you're using a compiler too old or set to not use the FPU, DSP instructions sensibly, or something else is going on here.

From a general DSP engineering perspective: often, there's much to solve in algorithmically or programming inefficiencies before specific properties of FPUs become relevant for application performance. Since already your 100 MHz Cortex-M4F is a pretty strong processor, 15 kS/s of throughput does sound like a pretty hefty DSP workload (700 CPU cycles per sample!) and it might really make sense to ask a question on the DSP StackExchange sister site, describing the algorithm you're doing, and specifically about how to do it.