Electronic – RPi GPIO speed issue in bare metal

armclock-speedembeddedgpioraspberry pi

After reading the excellent article on bare metal by David Welch (https://github.com/dwelch67/raspberrypi/tree/master/baremetal), a friend and I are trying to implement a simple GPIO toggle. It's built on David Welch's blinker01 (https://github.com/dwelch67/raspberrypi/tree/master/blinker01). I simply updated the peripheral registers to correspond to GPIO 3 on the raspberry pi. I found these addresses by looking through the BCM2835 data sheet (https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf). Chapter 6 lists all the locations of the GPIO pin settings.

We load the code onto an SD card and then power up. The program works: we get a 2.5MHz square wave on the scope.

** Why is it so slow? **

The processor has a clock speed of 1 Ghz. The operation we implemented loops 10 lines of assembly (i.e. does 10 lines of assembly then branches back to the first line of those 10). I'm new to bare metal/processors and I understand that a given line of assembly can take several clock cycles. Assuming each line takes 10 clock cycles, I would still expect a 100ns period for the output, which would be 10MHz. I feel that is a conservative lower bound on the output frequency, since some assembly lines only take a single cycle.

Furthermore, I found this article:
http://codeandlife.com/2012/07/03/benchmarking-raspberry-pi-gpio-speed/

This person managed to get a 22MHz output on a raspberry pi 1 using a similar approach but using linux' mmap. The code that they used as a base is located as the first example on this page:
http://elinux.org/RPi_GPIO_Code_Samples

** Edit: I originally thought that the 22MHz was achieved on a raspberry pi 2, but that is incorrect. They state that they achieved this output rate on a pi 1 which has the BCM 2835 chip in it **

Note: Although this debatably belongs on StackOverflow, I felt that it's a "harder" problem in that it has to do with the circuitry in the processor and the peripherals.

Edit: Assembly code is here:
Disassembly of section .text:

00008000 `<_start`>:

    8000:   e3a0d902    mov sp, #32768  ; 0x8000
    8004:   eb000005    bl  8020 <notmain>

00008008 `<hang`>:

    8008:   eafffffe    b   8008 <hang>

0000800c `<PUT32`>:

    800c:   e5801000    str r1, [r0]
    8010:   e12fff1e    bx  lr

00008014 `<GET32`>:

    8014:   e5900000    ldr r0, [r0]
    8018:   e12fff1e    bx  lr

0000801c `<dummy`>:

    801c:   e12fff1e    bx  lr

00008020 `<notmain`>:

    8020:   e92d4010    push    {r4, lr}
    8024:   e59f002c    ldr r0, [pc, #44]   ; 8058 <notmain+0x38>
    8028:   ebfffff9    bl  8014 <GET32>
    802c:   e3c01c0e    bic r1, r0, #3584   ; 0xe00
    8030:   e3811c02    orr r1, r1, #512    ; 0x200
    8034:   e59f001c    ldr r0, [pc, #28]   ; 8058 <notmain+0x38>
    8038:   ebfffff3    bl  800c <PUT32>
    803c:   e3a01008    mov r1, #8
    8040:   e59f0014    ldr r0, [pc, #20]   ; 805c <notmain+0x3c>
    8044:   ebfffff0    bl  800c <PUT32>
    8048:   e3a01008    mov r1, #8
    804c:   e59f000c    ldr r0, [pc, #12]   ; 8060 <notmain+0x40>
    8050:   ebffffed    bl  800c <PUT32>
    8054:   eafffff8    b   803c <notmain+0x1c>
    8058:   20200000    eorcs   r0, r0, r0
    805c:   2020001c    eorcs   r0, r0, ip, lsl r0
    8060:   20200028    eorcs   r0, r0, r8, lsr #32

`

Best Answer

The root culprit is a combination of hardware/peripheral limitations and clock configuration. While I have not had to work with BCM Baremetal specifically, these are common problems of baremetal projects on any complex architecture.

As a hint on the limitations of the GPIO output drivers, you can see that when hardwired as hardware clock output the maximum output frequency is 125MHz

From Page 106 of the BCM Peripheral Datasheet you provided

Operating Frequency: The maximum operating frequency of the General Purpose clocks is ~125MHz at 1.2V but this will be reduced if the GPIO pins are heavily loaded or have a capacitive load.

This is in the context of configuring the GPIO peripheral to output the peripheral clock directly without software toggling.

I would say that it is reasonable to expect even if the clocks are configured properly and the CPU running at maximum rate, you cannot expect a GPIO toggling faster than this due to hardware limitations.

AKA just because the peripheral may latch your software command in time, doesn't mean the physical output driver transistors, which are big and beefy, with a lot of inherent load can switch as fast as your code can run. If you are doing tests it is imperative that an oscilloscope with sufficient analog bandwidth and high quality probes are used, because you are also changing your result with the measurement system. A logic analyzer may not be sufficient, a slow slew rate is not identifiable with thresholded inputs.


How to Proceed

It seems that if your goal is to drive GPIO as fast as you need, for clock purposes, you should use this built in clock output pins of the peripheral. These are configured through registers

CM_GP0CTL CM_GP0DIV (repeat for gp clock outputs 1 and 2)

Then depending on the results of this you will identify the maximum switching frequency for your hardware system taking into account the maximum GPIO load and the VDD of your PIO circuit.

If the clock output is slower than expected for your nominal divider settings this would indicate that you have not configured the system clock, clock routing, and PLL's appropriately.

Once you have identified that disagreement you can tweak your bare metal bootstrap code to configure the PLLs and see if a software controlled toggling can be made to run as fast as the hardware controlled clock output and go from there.

Additional contributing factors may lie in the Instruction and Data Caches, which require software configuration, if you are unable to align the software toggling with the hardware limit through PLL alone that would be the next place I would look.