Electronic – SAMD21 1μs timer handler taking longer than the timer period

microcontrollersamd21timer

Before I begin I'd like to mention that I'm relatively new to working with microcontrollers at this low level, so please bear with me.

I am trying to use an Adafruit Trinket M0 to process and generate signals to communicate over the Nintendo GameCube communications protocol. Namely, the SAMD21 microcontroller needs to be able to output bit sequences with a frequency of 1 microsecond. I know this is possible on 16MHz AVR microcontrollers with the existence of this library; therefore I imagine it would be possible to do so as well on the 48MHz SAMD21.

From some research online I found this code sample, based on which I made the following sketch to toggle the output pin on the Trinket M0 every 1μs:

// Physical pin labelled 3 on the Trinket M0
const uint8_t PIN = 7;

volatile uint8_t n = 0;
volatile uint8_t next_bit = 0;

void TC5_Handler(void)
{
    // Toggle pin
    PORT->Group[PORTA].OUTTGL.reg = (1 << PIN);

    // Reset timer interrupt flag
    TC5->COUNT16.INTFLAG.bit.MC0 = 1;
}

void setup()
{
    // Initialize serial
    Serial.begin(115200);

    // Configure pin as output
    PORT->Group[PORTA].PINCFG[PIN].reg = PORT_PINCFG_INEN;
    PORT->Group[PORTA].DIRSET.reg = (1ul << PIN);

    // Enable generic clock for Timer/Counter 4 and 5
    GCLK->CLKCTRL.reg = GCLK_CLKCTRL_CLKEN | GCLK_CLKCTRL_GEN_GCLK0 |
        GCLK_CLKCTRL_ID(GCM_TC4_TC5);
    while (GCLK->STATUS.bit.SYNCBUSY);

    // Perform software reset
    TC5->COUNT16.CTRLA.reg = TC_CTRLA_SWRST;
    while (TC5->COUNT16.STATUS.reg & TC_STATUS_SYNCBUSY);
    while (TC5->COUNT16.CTRLA.bit.SWRST);

    // Configure TC5
    TC5->COUNT16.CTRLA.reg =
        TC_CTRLA_MODE_COUNT16 |     // Counter of 16 bits
        TC_CTRLA_WAVEGEN_MFRQ |     // Match frequency
        TC_CTRLA_PRESCALER_DIV1;    // Prescaler of 1 (no division)
    TC5->COUNT16.CC[0].reg = F_CPU / 1e6 - 1; // Trigger @ 1MHz=10^6Hz
    while (TC5->COUNT16.STATUS.bit.SYNCBUSY);

    // Configure interrupt
    uint32_t oldISER = NVIC->ISER[0];
    NVIC->ICER[0] |= ~0ul;  // Disable all interrupts
    NVIC->ICPR[0] = 1ul << TC5_IRQn;            // Clear pending timer interrupt
    NVIC->ISER[0] = 1ul << (TC5_IRQn & 0x1f);   // Enable TC5 interrupt
    TC5->COUNT16.INTENSET.bit.MC0 = 1;          // Match/Compare 0 interrupt
    while (TC5->COUNT16.STATUS.bit.SYNCBUSY);

    // Start counter
    TC5->COUNT16.CTRLA.reg |= TC_CTRLA_ENABLE;  // Enable TC5
    while (TC5->COUNT16.STATUS.bit.SYNCBUSY);
}

void loop()
{
}

This works great. My logic analyzer shows me perfect 1μs pulses.

However, trying to add a small conditional statement in there to output "0" bit sequences according to the GameCube protocol causes the signal to instead adopt a frequency of closer to 2μs.

void TC5_Handler(void)
{
    // Toggle pin
    (next_bit ?
        PORT->Group[PORTA].OUTSET.reg :
        PORT->Group[PORTA].OUTCLR.reg) = (1 << PIN);
    next_bit = (n & 3) < 3 ? 0 : 1;
    ++n;

    // Reset timer interrupt flag
    TC5->COUNT16.INTFLAG.bit.MC0 = 1;
}

I didn't expect that little added code to suddenly make the function take that much longer than a microsecond. In that case, how can I make this task more efficient? Must I pursue a different route with generating these signals?

Best Answer

You only have 48 clock cycles to handle each 1us interrupt. I kind of agree with you that it would seem it should be possible to get such a simple interrupt routine to operate close to 1us interval at 48MHz. However, you probably have other interrupts in the system taking CPU time and as it stands your interrupt routine would need roughly 100% of CPU to operate at 1us intervals. - any additional ISRs would make this impossible to achieve.

I wouldn't go too deep into analysing why it is taking close to 100 cycles to handle the interrupt or trying to optimise it into assembly, because bit-banging here is the wrong approach. The MCU is spending almost 100% of its time doing stuff that peripherals were created to do for you.

You can use the SAMD21s SERCOM USART. Set it into 6-bit mode with 1 stop bit. the nintendo protocol you referenced represents 1-bit using 4x 1us pulses. The first pulse will always be low and the last pulse always high. The USART's start-bit which is always low will represent the first pulse of bit-one. The USART's stop bit which is always high will represent the last pulse of bit-two. Each time you send a 6-bit sequence via USART you will be covering 8-pulses representing 2-bits of information.

The 6-bits you write to the USART data register will represent the last 3 pulses of bit-one and the first 3 pulses of bit-two. So a 01 would be sent as 0b001011 - you will need to double check the endian-ness (bit order).

Now you only need a USART interrupt once every 8us instead of every 1 us, giving you 384 cycles to handle the interrupt not 48.

If you need even more performance, you could set up DMA to feed the USART with data, further reducing CPU interrupt load. - but this probably isn't needed for your application.

EDIT - ISR optimisation To improve ISR performance, as a general rule, minimise how much is done in the ISR.

Taking the example of your existing ISR; Consider pre-computing the sequence of set/clear operations outside the ISR (inside main()) and store the sequence of actions into a buffer. This allows you to inline operations which decide what action to take each pulse since you only have 2 sequences to generate, 1 sequence for 0 bit and 1 sequence for 1 bit. You don't need to do a bunch of conditionals for each pulse, only a conditional per bit, if you are pre-calculating. For example in psedo-code:

enum EAction
{
  CLEAR,
  SET
};

**// somewhere inside main()**
// check for buffer bounds somewhere before continuing, action_buffer size must be a multiple of 4
if(bit_high)
{
  action_buffer[i++] = CLEAR;
  action_buffer[i++] = SET;
  action_buffer[i++] = SET;
  action_buffer[i++] = SET;
}
else
{
  action_buffer[i++] = CLEAR;
  action_buffer[i++] = CLEAR;
  action_buffer[i++] = CLEAR;
  action_buffer[i++] = SET;
}

**// inside ISR**
// add buffer bounds check
if(action_buffer[j++] == CLEAR)
{
   PORT->Group[PORTA].OUTCLR.reg = (1 << PIN);
}
else
{
   PORT->Group[PORTA].OUTSET.reg = (1 << PIN);
}
...

Your last example of void TC5_Handler(void) only generates the sequences for a 1 bit value, so I am guessing you have not posted the code that has the additional conditionals for generating a 0 bit value. The code you posted already had 2 conditional statements and probably had 1 or 2 more for the full implementation. With precomputed values you will only need 2 conditionals inside the ISR, the 1 shown above, plus bounds checking on the buffer. You then have another couple of conditionals in the main() loop, but these are 2 per 4 pulses instead of 2 per 1 pulse.

The challenge with this approach is in managing the buffer to make sure it is "thread-safe" you don't want race conditions caused by the main() and ISR accessing the buffer at the same time. Doing transactions in bursts where the main() fills the buffer while the ISR does nothing, then once the buffer is filled the main() kicks off the ISR which then runs until the buffer is empty - is one approach.

Note that technically the pin doesn't need to be set or cleared multiple times, but adding a third action to do nothing will not improve behaviour and only add in an extra conditional check inside the ISR, therefore the code above repeats set or clear each time.

You can also improve ISR performance by pre-computing the register addresses to be accessed inside the ISR. PORT->Group[PORTA].OUTCLR.reg has 2 levels of indirection which means the ISR has to do a lot of address pointer arithmetic. If PORT is a pointer constant (* const) then a compiler with optimisations turned-on might be able to hard-code the address and avoid the indirection for you. Alternatively you could hard code the register address.

Although bear in mind that when optimising for performance you often sacrifice the readability and flexability of code, it also takes more time. I would avoid hard-coding register addresses unless you really have to for performance. You might find that just switching on compiler optimisations deals with most of these sorts of things for you.

Related Topic