Electronic – AVR GCC: How to improve code optimization

attinyavrcgccoptimization

I tried to compile the following C code:

period = TCNT0L;
period |= ((unsigned int)TCNT0H<<8);

The assembler code I'm getting is the following:

    period = TCNT0L;
  d2:   22 b7           in  r18, 0x32   ; 50
  d4:   30 e0           ldi r19, 0x00   ; 0
  d6:   30 93 87 00     sts 0x0087, r19
  da:   20 93 86 00     sts 0x0086, r18
    period |= ((unsigned int)TCNT0H<<8);
  de:   44 b3           in  r20, 0x14   ; 20
  e0:   94 2f           mov r25, r20
  e2:   80 e0           ldi r24, 0x00   ; 0
  e4:   82 2b           or  r24, r18
  e6:   93 2b           or  r25, r19
  e8:   90 93 87 00     sts 0x0087, r25
  ec:   80 93 86 00     sts 0x0086, r24

So instead of 4 instructions it gets 11!

I tried to choose O1, O2, O3 and Os optimization options. The result is the same (except that O3 option optimized away this code at all).

I could write the source code in the following way:

period = TCNT0L | ((unsigned int)TCNT0H<<8);

I will get smaller, but still not optimal code:

  de:   22 b7           in  r18, 0x32   ; 50
  e0:   34 b3           in  r19, 0x14   ; 20
  e2:   93 2f           mov r25, r19
  e4:   80 e0           ldi r24, 0x00   ; 0
  e6:   82 2b           or  r24, r18
  e8:   90 93 87 00     sts 0x0087, r25
  ec:   80 93 86 00     sts 0x0086, r24

However I will not have a guaranty that the lower byte will be accessed first any more (this is essential requirement to keep 16-bit reading correct). And still the code has many extra unnecessary instructions.

Am I able to change compiler options and/or change the source code to make it better? I'd avoid go to assembler.

UPDATE1:

I tried the code @caveman suggested:

((unsigned char*)(&period))[0] = TCNT0L;
((unsigned char*)(&period))[1] = TCNT0H;

But the result is also not very good:

    ((unsigned char*)(&period))[0] = TCNT0L;
  dc:   82 b7           in  r24, 0x32   ; 50
  de:   e6 e8           ldi r30, 0x86   ; 134
  e0:   f0 e0           ldi r31, 0x00   ; 0
  e2:   80 83           st  Z, r24
    ((unsigned char*)(&period))[1] = TCNT0H;
  e4:   84 b3           in  r24, 0x14   ; 20
  e6:   81 83           std Z+1, r24    ; 0x01

Best Answer

One method is to use direct loads to the halves of period. While this looks complicated in C, it usually will generate very tight assembly, i.e. 2 loads and 2 stores.

((uint8_t*)(&period))[0] = TCNT0L;
((uint8_t*)(&period))[1] = TCNT0H;

Sometimes using the array math can cause issues so you could try this:

*((uint8_t*)(&period)) = TCNT0L;
*((uint8_t*)(&period) + 1) = TCNT0H;

This actually produces optimal code. Look at how there are 12 bytes used.

  ((unsigned char*)(&period))[0] = TCNT0L;
  dc:   82 b7           in  r24, 0x32   ; 50
  de:   e6 e8           ldi r30, 0x86   ; 134
  e0:   f0 e0           ldi r31, 0x00   ; 0
  e2:   80 83           st  Z, r24
    ((unsigned char*)(&period))[1] = TCNT0H;
  e4:   84 b3           in  r24, 0x14   ; 20
  e6:   81 83           std Z+1, r24    ; 0x01

If you did this with assembly, it would probably seem better to do it like this. It is also 12 bytes, so they are equivalent.

  dc:   82 b7           in  r24, 0x32   ; 50
  de:   80 93 86 00     sts 0x0086, r24
  e2:   84 b3           in  r24, 0x14   ; 20
  e4:   80 93 87 00     sts 0x0087, r24

Of course, when I say "equivalent", I mean regarding code size. If time is more important, then you have to look at the cycles. In this case it looks like the assembly version is 6 cycles and the compiler's version is 8 cycles.