Electronic – Tiny Code for dumping Flash Memory

assemblycembeddedfirmwaremicrocontroller

I'm trying to write as tiny code as possible to extract the firmware of Infineon's XMC4500 microcontroller.

The code must fit into a 30 byte buffer which allows me to have 15 machine instructions using Thumb 16-bit instruction set.

Starting with C my attempt is to dump flash memory through a single GPIO pin (see original question) following this nifty trick.

Basically what I'm doing is:

  1. Setup the GPIO pin directions to output
  2. Blink LED1 (pin 1.1) with a clock (SPI serial clock)
  3. Blink LED2 (pin 1.0) with data bits (SPI MOSI)
  4. Sniff pins with a logic analyzer

EDIT:

  1. UPDATE C CODE BLOCK
  2. ADD ASSEMBLY CODE BLOCK
#include "XMC4500.h"

void main() {
  // start dumping at memory address 0x00000000
  unsigned int* p = (uint32_t *)(0x0u);

  // configure port1 output (push-pull)
  PORT1->IOCR0 = 0x8080u;

  for(;;) {
    int i = 32;

    int data = *(p++);

    do {
      // clock low
      PORT1->OUT = 0x0;

      // clock high with data bits
      PORT1->OUT = 0x2u | data;

      data >>= 1;

    } while (--i > 0);
  }
}
main:
    ; PORT1->IOCR0 = 0x8080UL
    ldr r1, =0x48028100 ; load port1 base address to R1
    movw r2, #0x8080 ; move 0x8080 to R2
    str r2, [r1, #0x10]

main_1:
    ; start copying at address 0x00000000
    ; R12 is known to be zeroed
    ldr.w r2, [r12], #0x4 ; int data = *(p++)
    movs r3, #32 ; int i = 32

main_2:
    ; PORT1->OUT = 0x0
    ; clock low
    ; R12 is known to be zeroed
    str r12, [r1]

    ; PORT1->OUT = 0x2 | data
    ; clock high with data bits
    orr r4, r2, #0x2
    str r4, [r1]

    asrs r2, r2, #0x1 ; data >>= 1

    subs r3, r3, #0x1 ; i--
    bne.n main_2 ; while (--i > 0)
    b.n main_1 ; while(true)

However code size is still too big to meet my requirements.

Is there anything I can do to further shrink down my code? Anything that can be optimized or left out?

Best Answer

I'm not used to program for ARM processors and don't know which compiler you use, so maybe the proposed changes does nothing at all, but hey, at least let's try!

1-Inline your functions:

A good compiler should already inline a function if optimizations are well set, but it's worth to inline it so you remove the call's and ret's

2-Avoid branching:

In some architectures an IF can be translated in: load, test, branch, three instructions, if you can do it without branching it can use less instructions.

So, the proposed code is:

    int main() 
    {
      // start dumping at memory address 0x08000000
      unsigned int *p;
      int i;

      p = (uint32_t *)(0x08000000u);

      // configure pin 1.0 and pin 1.1 as output (push-pull)
      PORT1->IOCR0 = 0x8080UL;

      do
      {
        for (i = 0; i < 32; i++)
        {
          // set pin 1.1 to low (SPI clock)
          PORT1->OUT &= (~0x2UL);
          PORT1->OUT = (PORT1->OUT & 0xFFFE) | (data & 0x01)
          PORT1->OUT |= 0x2UL;
          data >>= 1;
        }

      }while(p++)
    }

Give it a try and comment the results.