ARM Cortex-M4 – VFMA Performance and Clock Cycles Explained

armassemblycortex-mcortex-m4dsp

I am working on some performance-critical DSP code destined to run on an ARM Cortex-M4. One particular section of the code (a sinc interpolation function) is dense with multiply-accumulate operations and I am trying to ensure that performance is as good as possible so we can clock the MCU slower and save power.

Now, I have inspected the code emitted by arm-none-gcc-eabi for my interpolation function and it was not performant enough, so I have unrolled and rewritten the inner loop in assembly to use a string of VFMA fused multiply-add instructions like so:

VFMA.F32  S8, S24, S16
VFMA.F32  S8, S25, S17
VFMA.F32  S8, S26, S18
VFMA.F32  S8, S27, S19
VFMA.F32  S8, S28, S20
VFMA.F32  S8, S29, S21
VFMA.F32  S8, S30, S22
VFMA.F32  S8, S31, S23

To my surprise, however, the Cortex-M4 Technical Reference Manual says something strange about the performance of the VFMA instruction. While properly scheduled VADD and VMUL operations each take a single clock cycle, the CM4 TRM says that VFMA takes three clock cycles! On that basis, one would conclude that the fastest unrolled loop should consist of interleaved VMUL and VADD instructions, rather than half as many VFMA instructions.

There has been some discussion online about this issue but information is sparse as well as inconsistent. Some say that VFMA is aimed at code size reduction rather than speed improvement and that 3 cycles is normal. Others report observing a 2-clock execution time in a long unrolled loop, contrary to the CM4 TRM. One copy of the TRM says that when multiple VFMA operations are executed sequentially, the results are forwarded and the execution time is only 1 clock cycle. Some say that many of the slower VFMA measurements posted online suffer from additional slowdown due to flash wait states or improper configuration of the MCU's prefetch engine.

Can anyone shed some light on what factors influence the timing of VFMA on the Cortex-M4?

Best Answer

In the TRM, there is the statement:

Floating-point arithmetic data processing instructions, such as add, subtract, multiply, divide, square-root, all forms of multiply with accumulate, as well as conversions of all types take one cycle longer if their result is consumed by the following instruction.

A sequence of VMUL, VADD will take the same number of cycles (3) as an isolated VFMA if they are back to back. The compiler can take advantage of reordering instructions to remove this hazard. Therefore, a sequence of VMUL, [..], VADD will always perform as well or better than the equivalent VFMA instruction. It's also worth bearing in mind that the code size using VFMA will be reduced, relative to the VMUL, VADD.

Re. the measurements, it's hard to separate the artifacts due to vendor's implementation, remembering the Cortex-M will only be a part of a (much) larger system. In the 1000s of pages of documentation, it's very easy to miss something like a flash wait state etc.

Related Topic