Electronic – My “flash VS RAM” speed test doesn’t work. Why

cflashmemoryram

In trying to demonstrate that copying information from a string in flash memory to RAM takes longer than copying from the same information in an array in RAM to another, I ran the following code on the PIC32 USB Starter Kit II (PIC32MX795F512L):

// Global variables:
char b[60000] = "Initialized";
char c[] = "Hello";

// In main():
// Version 1: Copying from flash to RAM:
    while(1){
        strcpy(b, "Hello");
         for (i = 0; i < 999; i++){
             strcat(b, "Hello");  
         }
         PORTD ^= 1; // Toggle LED

    }

// Version 2: Copying from RAM to RAM:
    while(1){
        strcpy(b, c);
         for (i = 0; i < 999; i++){
             strcat(b, c);  
         }
         PORTD ^= 1; // Toggle LED

    }

I expected to see the LED blinking faster in Version 2, but instead Version 1 was a lot faster! How could this be?

Could it be that instead of copying information from flash, we are using immediate data hardcoded into MIPS machine language? Maybe I should try to understand the MIPS code.

Thanks in advance!

Best Answer

There are several very different possible explanations. Here are two I thought of:

  1. the CPU clock is running so slowly that FLASH is as fast as RAM, and because Flash to RAM uses two independent buses, it actually has more bandwidth.

EDIT: I've looked at a Microchip data sheet for the PIC32MX5XX/6XX/7XX
In Table 31-12: "PROGRAM FLASH MEMORY WAIT STATE CHARACTERISTIC", it says:

  • 0 Wait State 0 to 30 MHz
  • 1 Wait State 31 to 60 MHz
  • 2 Wait States 61 to 80 MHz

So if the CPU clock is 30MHz or less Flash memory can keep up with no waits. I can't find any timing specification for the SRAM, so I assume it has no wait states at any speed. So running at 30MHz or less Flash should be as fast as SRAM.

Even above that 30 MHz clock speed, the Flash wait states might have a much smaller impact than expected because of the 'Prefetch Cache'. This cache has 16 16-byte cache lines. So, if the program loop is under 256bytes (which is feasible for that loop), once the Prefetch Cache is loaded, the program may all execute from the Prefetch cache, without any further access to Flash. Clearly this benefits both loops. However version 2 is accessing RAM twice to both read and write data, whereas version 1 might only access memory to write to RAM.

Allowing for explanation 2, as well, if the compiler has also 'optimised' the strcat(b, "Hello"); making the version 1 loop faster than the version 2 loop, then the only access in version 1 to memory is to store bytes into b. That should be significantly faster than copy from RAM to RAM.

  1. The compiler optimises Version 1 like crazy. "Hello" is a constant. It could even fit into two 32bit registers, so the compiler could turn version 1 into a very tight loop. I'm assuming the clocks are correct, and some version of this is the explanation.

Optimising Version 1 is especially well-done for compilers which have proper knowledge of strcpy and strcat. gcc has internal built in versions of strcpy and strcat, which it can choose to use in appropriate circumstances. Also gcc optimises strcpy and strcat for several processors into different sequences), I believe it might even expand the strcat in line in some cases, but can't find the reference.

So dump the assembler and have a look. For gcc ARM Cortex-M it is arm-none-eabi-objdump. This will dump a textual version of your program, showing the assembler, usually organised into functions, and, if the right options are used, it can intermingle the original C source as comments, making it relatively easy to find the assembler instructions which correspond to your code. (Though be aware, due to optimisation, this mapping might not be perfect)

If the data, "Hello" in version 1, is just loaded into registers and written to RAM in a tight loop, then it may be clear from an assembler dump, even without a deep knowledge of MIPS assembler.

What if you want to do an actual comparison of flash vs RAM?
You could make optimisation hard for the compiler, and try to prevent it optimising the two versions of the loop differently.

One approach would be to force the compiler to store the "Hello" into a variable which you force into Flash.

I don't know the mechanism for MIPS, but there is very likely a pragma or way to ask for a variable to be placed into the flash segment of the program by the linker.

For gcc for ARM a variable can be 'decorated' with an annotation:
const uint8_t array[10] __attribute__((section(".eeprom"), used))
This marks the variable so that it will be put into the linker's ".eeprom" section, and the linker's link script ensures all addresses for that section are in the Flash memory address range).

However, you may also need to defeat the compiler optimisations when applied to a constant string value.

Put a general purpose version of the while loop into a separate function, myfunc(a *char, b *char). Then call it with two different sets of variables (RAM to RAM vs FLASH to RAM). This should normally force the compiler to generate one set of code (the body of myfunc), which will be used for both cases. That would give an 'apples for apples' comparison. However, don't underestimate a compilers ability to optimise. You might still want to dump the assembler to check the compiler isn't being too clever.

(I would put a limit on the number of iterations to stop it scribbling on peripherals)

All of this is speculation. You will need to provide more information, specifically the initialisation of the CPU clock, busses and buffers, the compiler you are using, and ideally an assembler dump, to give more accurate answers.

However, making the change to a single function running a large for loop, and calling that with two different sets of parameters might be sufficient to satisfy your requirement