Electronic – Confused about the XIP (eXecute In Place) function of QSPI FLASH

There are lots of NOR QSPI FLASH chips that support XIP (eXecute In Place). In this mode the embedded cpu (or MCU) can directly execute the codes stored in the flash. But as we know, the qspi flash can only output 4-bit data per cycle (if in DDR mode, 8-bit data per cycle), while many MCUs, such as ARM Cortex-M series, need a 32-bit instruction per cycle. So the MCU have to wait at least 8 cycles to get a valid instruction, which seems very slow. Besides, the max frequency of a nor qspi flash chip is often below 130MHz and the frequency of a MCU could reach 200MHz or higher, which means longer delay for cpu to receive a valid instruction from an external flash.

Although we can program some bootloader codes that copy the user application codes and data into Instruction/Data RAM after the power-on, I think this operation would cost lots of time. Besides, if the RAMs in MCU are not big enough to store the codes and data in the flash, it will also make things difficult. I know that we could add I/D Cache in the MCU, but this also means more complexity. For example, STM32F479XX doesn't have cache, but it has QSPI interface which support XIP.

I don't know if my understanding is wrong, but I really couldn't find much details about XIP. The Techinal Reference Manuals of STM32Fxxx only say that they have QSPI interface and support XIP, but they don't show any details. Therefore, I guess we also need to implement a very complicated QSPI controller in the MCU to support XIP.

Can anyone give me some answers to this question? Is there any books or websites that tell how to design a QSPI interface that support XIP?

Best Answer

You are on the right track, XIP is generally slow, which is why it is really only used for first stage bootloaders and severely RAM-constrained devices.

The big advantage to booting from flash is that a read-only flash interface has very few configuration parameters, so the controller can start with a simple conservative flash interface, and the RAM parameters are stored in the configuration flash, in the form of instructions that set up the RAM controller registers.

This gives the maximum amount of flexibility to the board designers, because the MCU hardware does not make any assumptions beyond the flash chip.

How much code you actually run from flash vs copy to RAM first is a trade-off, affected by how often the code is run, how much of it fits into caches and how predictable execution time needs to be.

In CPUs that have caches, typically entire cache lines would be loaded at once from flash, which takes care of the bus width problem nicely as well.

Best Answer

Related Solutions

Electronic – Cycle counting with modern CPUs (e.g. ARM)

Electronic – How to Efficiently Decode Non-Standard Serial Signal

Related Topic