I tried to compile the following C code:
period = TCNT0L;
period |= ((unsigned int)TCNT0H<<8);
The assembler code I'm getting is the following:
period = TCNT0L;
d2: 22 b7 in r18, 0x32 ; 50
d4: 30 e0 ldi r19, 0x00 ; 0
d6: 30 93 87 00 sts 0x0087, r19
da: 20 93 86 00 sts 0x0086, r18
period |= ((unsigned int)TCNT0H<<8);
de: 44 b3 in r20, 0x14 ; 20
e0: 94 2f mov r25, r20
e2: 80 e0 ldi r24, 0x00 ; 0
e4: 82 2b or r24, r18
e6: 93 2b or r25, r19
e8: 90 93 87 00 sts 0x0087, r25
ec: 80 93 86 00 sts 0x0086, r24
So instead of 4 instructions it gets 11!
I tried to choose O1, O2, O3 and Os optimization options. The result is the same (except that O3
option optimized away this code at all).
I could write the source code in the following way:
period = TCNT0L | ((unsigned int)TCNT0H<<8);
I will get smaller, but still not optimal code:
de: 22 b7 in r18, 0x32 ; 50
e0: 34 b3 in r19, 0x14 ; 20
e2: 93 2f mov r25, r19
e4: 80 e0 ldi r24, 0x00 ; 0
e6: 82 2b or r24, r18
e8: 90 93 87 00 sts 0x0087, r25
ec: 80 93 86 00 sts 0x0086, r24
However I will not have a guaranty that the lower byte will be accessed first any more (this is essential requirement to keep 16-bit reading correct). And still the code has many extra unnecessary instructions.
Am I able to change compiler options and/or change the source code to make it better? I'd avoid go to assembler.
UPDATE1:
I tried the code @caveman suggested:
((unsigned char*)(&period))[0] = TCNT0L;
((unsigned char*)(&period))[1] = TCNT0H;
But the result is also not very good:
((unsigned char*)(&period))[0] = TCNT0L;
dc: 82 b7 in r24, 0x32 ; 50
de: e6 e8 ldi r30, 0x86 ; 134
e0: f0 e0 ldi r31, 0x00 ; 0
e2: 80 83 st Z, r24
((unsigned char*)(&period))[1] = TCNT0H;
e4: 84 b3 in r24, 0x14 ; 20
e6: 81 83 std Z+1, r24 ; 0x01
Best Answer
One method is to use direct loads to the halves of period. While this looks complicated in C, it usually will generate very tight assembly, i.e. 2 loads and 2 stores.
Sometimes using the array math can cause issues so you could try this:
This actually produces optimal code. Look at how there are 12 bytes used.
If you did this with assembly, it would probably seem better to do it like this. It is also 12 bytes, so they are equivalent.
Of course, when I say "equivalent", I mean regarding code size. If time is more important, then you have to look at the cycles. In this case it looks like the assembly version is 6 cycles and the compiler's version is 8 cycles.