Electrical – Code Optimization of IF Statements

arduinoccodeoptimization

Considering the design limitation of hardware (clock speed, memory, etc…), is there any better way to write IF statements more efficiently in C language (e.g. run on Arduino), more specifically the following piece of code of an embedded software?

if (A >= 476 && A <= 524) A = 500;
else if (A >  524 && A <= 860) A = (A * 1.4) - 760;
else if (A >  136 && A <  476) A = (A * 1.4) - 685;
else if (A > 860) A = 1000;
else if (A <= 136) A = 100;

A is passed as value by a sub-routine and this value is then transformed into a different value according to its position in a set range (of values). It has to run as fast as possible!

UPDATE!

A is a variable of integer type i.e. int

SUMMARY:
I summarized here the options I learned from you:

a) improve structure

consider first comparison for most probable inputs;
eliminate && operators.

b) eliminate floating point calculation
c) create look up tables
d) assign more adequate variable type
e) write code in abbreviated form (*= , += , -= )

a) Indeed, I already put the comparison 'if (A >= 476 && A <= 524)' first, because I expect most of the values passed by the subroutine to be within the first range. So most of the time the program doesn't execute the next if statements.

b) I reviewed the floating point calculation and consider it a bad choice for this piece of code. It definitely can be avoided by using a divider of integers.

c) The lookup table seems a good idea but not for this application, because the program has at least 3 more of such clustered comparisons. Memory might become a problem.

d) The variable was assigned as int although it would never really become negative. Unsigned int is what I assume is better. But I would say it doesn't make the code faster.

e) Writing the code in abbreviated form seems also another improvement idea. I just don't know for sure how different compilers might interpret that or if it really saves so much time.

Thank you.

Best Answer

The floating point is the killer here. It appears to be doubling the amount of code generated, not that number of instructions is necessarily slower.

It has to make a number of calls to what I assume is a float to int int to float. Interestingly changing from double float (1.4) to single float (1.4F) didnt change the overall result, although again the code is making a lot of calls (I assume to float conversions or operations).

So even the times 7 has to be synthesized.

unsigned int fun ( unsigned int A )
{
A = A*7;
return(A);
}

00000000 <fun>:
   0:   28 2f           mov r18, r24
   2:   39 2f           mov r19, r25
   4:   22 0f           add r18, r18
   6:   33 1f           adc r19, r19
   8:   22 0f           add r18, r18
   a:   33 1f           adc r19, r19
   c:   22 0f           add r18, r18
   e:   33 1f           adc r19, r19
  10:   42 2f           mov r20, r18
  12:   53 2f           mov r21, r19
  14:   48 1b           sub r20, r24
  16:   59 0b           sbc r21, r25
  18:   84 2f           mov r24, r20
  1a:   95 2f           mov r25, r21
  1c:   08 95           ret

so one might try this as an alternate.

unsigned int fun ( unsigned int A )
{
A = (A<<2)+(A<<1)+A;
return(A);
}

00000000 <fun>:
   0:   28 2f           mov r18, r24
   2:   39 2f           mov r19, r25
   4:   22 0f           add r18, r18
   6:   33 1f           adc r19, r19
   8:   22 0f           add r18, r18
   a:   33 1f           adc r19, r19
   c:   48 2f           mov r20, r24
   e:   59 2f           mov r21, r25
  10:   44 0f           add r20, r20
  12:   55 1f           adc r21, r21
  14:   24 0f           add r18, r20
  16:   35 1f           adc r19, r21
  18:   82 0f           add r24, r18
  1a:   93 1f           adc r25, r19
  1c:   08 95           ret

same number of instructions. to see this easier

unsigned char fun ( unsigned char A )
{
A = A * 7;
return(A);
}

00000000 <fun>:
   0:   98 2f           mov r25, r24
   2:   99 0f           add r25, r25
   4:   99 0f           add r25, r25
   6:   99 0f           add r25, r25
   8:   98 1b           sub r25, r24
   a:   89 2f           mov r24, r25
   c:   08 95           ret

and using shifting and adding

00000000 <fun>:
   0:   98 2f           mov r25, r24
   2:   99 0f           add r25, r25
   4:   99 0f           add r25, r25
   6:   28 2f           mov r18, r24
   8:   22 0f           add r18, r18
   a:   92 0f           add r25, r18
   c:   89 0f           add r24, r25
   e:   08 95           ret

Duh, right A*7 = (A<<3)-1;

00000000 <fun>:
   0:   98 2f           mov r25, r24  r25 = A
   2:   99 0f           add r25, r25  r25 = A + A = 2A
   4:   99 0f           add r25, r25  r25 = 2A + 2A = 4A
   6:   99 0f           add r25, r25  r25 = 4A + 4A = 8A
   8:   98 1b           sub r25, r24  r25 = 8A - A = 7A
   a:   89 2f           mov r24, r25  return 7A
   c:   08 95           ret

The (A<<2)|(A<<1)+A source gives this

00000000 <fun>:
   0:   98 2f           mov r25, r24  r25 = A
   2:   99 0f           add r25, r25  r25 = A + A = 2A
   4:   99 0f           add r25, r25  r25 = 2A + 2A = 4A
   6:   28 2f           mov r18, r24  r18 = A
   8:   22 0f           add r18, r18  r18 = A + A = 2A
   a:   92 0f           add r25, r18  r15 = 4A + 2A = 6A
   c:   89 0f           add r24, r25  return (6A + A)
   e:   08 95           ret

But they missed a possible optimization

   mov  r25, r24  r25 = A
   add  r25, r25  r25 = A + A = 2A
   mov  r18, r25  r18 = 2A
   add  r25, r25  r25 = 2A + 2A = 4A
   add  r25, r18  r15 = 4A + 2A = 6A
   add  r24, r25  return (6A + A)
   ret

same number of instructions at least, ideally execute the same. Looking at these trip points

if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 1.4) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 1.4) - 760;
else A= 1000;

A is obviously not an 8 bit number is at least 16 bit, if it is a double or a float I give up, game over. Assuming 16 bits...

Here are the constants in hex.

136 0x88
476 0x1dc
524 0x20C
860 0x35C

Looking at this

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 1;
else if ( A < 476 ) A=2;
else if ( A <= 524 ) A=3;
else A= 4;
return(A);
}

00000000 <fun>:
   0:   89 38           cpi r24, 0x89   ; 137
   2:   91 05           cpc r25, r1
   4:   00 f0           brcs    .+0         ; 0x6 <fun+0x6>
   6:   8c 3d           cpi r24, 0xDC   ; 220
   8:   21 e0           ldi r18, 0x01   ; 1
   a:   92 07           cpc r25, r18
   c:   00 f4           brcc    .+0         ; 0xe <fun+0xe>
   e:   82 e0           ldi r24, 0x02   ; 2
  10:   90 e0           ldi r25, 0x00   ; 0
  12:   08 95           ret
  14:   8d 30           cpi r24, 0x0D   ; 13
  16:   92 40           sbci    r25, 0x02   ; 2
  18:   00 f4           brcc    .+0         ; 0x1a <fun+0x1a>
  1a:   83 e0           ldi r24, 0x03   ; 3
  1c:   90 e0           ldi r25, 0x00   ; 0
  1e:   08 95           ret
  20:   81 e0           ldi r24, 0x01   ; 1
  22:   90 e0           ldi r25, 0x00   ; 0
  24:   08 95           ret
  26:   84 e0           ldi r24, 0x04   ; 4
  28:   90 e0           ldi r25, 0x00   ; 0
  2a:   08 95           ret

does this help

    unsigned int fun ( unsigned int A )
    {
        unsigned char B = A>>2;
        if ( B <= (136>>2) ) A= 1;
        else if ( B < (476>>2) ) A=2;
        else if ( B <= (524>>2) ) A=3;
        else A= 4;
        return(A);
    }

00000000 <fun>:
   0:   96 95           lsr r25
   2:   87 95           ror r24
   4:   96 95           lsr r25
   6:   87 95           ror r24
   8:   83 32           cpi r24, 0x23   ; 35
   a:   00 f0           brcs    .+0         ; 0xc <fun+0xc>
   c:   87 37           cpi r24, 0x77   ; 119
   e:   00 f4           brcc    .+0         ; 0x10 <fun+0x10>
  10:   82 e0           ldi r24, 0x02   ; 2
  12:   90 e0           ldi r25, 0x00   ; 0
  14:   08 95           ret
  16:   84 38           cpi r24, 0x84   ; 132
  18:   00 f4           brcc    .+0         ; 0x1a <fun+0x1a>
  1a:   83 e0           ldi r24, 0x03   ; 3
  1c:   90 e0           ldi r25, 0x00   ; 0
  1e:   08 95           ret
  20:   81 e0           ldi r24, 0x01   ; 1
  22:   90 e0           ldi r25, 0x00   ; 0
  24:   08 95           ret
  26:   84 e0           ldi r24, 0x04   ; 4
  28:   90 e0           ldi r25, 0x00   ; 0
  2a:   08 95           ret

ehh, man this instruction set is painful.

so looking at this

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 7);
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 7);
else A= 1000;
return(A);
}

the compiler is smart enough to lump the A*7's together.

Like this though

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 7) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 7) - 760;
else A= 1000;
return(A);
}

It does not combine the A*7s note the 16 bit A*7 looks like this

  30:   28 2f           mov r18, r24
  32:   39 2f           mov r19, r25
  34:   22 0f           add r18, r18
  36:   33 1f           adc r19, r19
  38:   22 0f           add r18, r18
  3a:   33 1f           adc r19, r19
  3c:   22 0f           add r18, r18
  3e:   33 1f           adc r19, r19
  40:   42 2f           mov r20, r18
  42:   53 2f           mov r21, r19
  44:   48 1b           sub r20, r24
  46:   59 0b           sbc r21, r25
  48:   84 2f           mov r24, r20
  4a:   95 2f           mov r25, r21
  4c:   08 95           ret

for ease of reading and coding the comparisons in order

if ( A > 860 ) A= 1000;
else if ( A > 524 ) A= (A * 1.4) - 760;
else if ( A >= 476 ) A= 500;
else if ( A > 136 ) A= (A * 1.4) - 685;
else A= 100;

if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 1.4) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 1.4) - 760;
else A= 1000;

Might shave some instructions for the worst case path, but if your data happens to fall in a bell curve of some sort, most of the samples are say between 500 and 600, then you want to have those ranges compared first then let the others fall through. it may or may not be worth trying to get the ranges that end with A*1.4 as the last two, so that before those compares you do a single B = A * 1.4. you do the compare of one of the ranges then either subtract A = B - 685 or A = B - 760, saving some codes space, not necessarily saving on performance because with an if-then-else tree like this re-ordering, etc is not going to save you much, some paths are shorter some are longer, you may add a few or save a few instructions here and there, but it is still a pretty long path for this code on this architecture.

I didnt do a full link with a library, but looking at readelf:

11: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND __floatunsisf
12: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND __mulsf3
13: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND __subsf3
14: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND __fixunssfsi

Those floating point libraries are going to kill you and that answers something, even though the C standard says that A * 1.4 is a double operation the compiler is clearly doing single float, ignoring the standard.

Changing the A * 1.4 to A * 7 / 5 implements the long string of adds to implement the times 7 and does the divide with a gcc library

12: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND __udivmodhi4

That is going to save you on overall code space and should save on execution when you would have normally had to deal with generic floating point libraries.

Man, I didnt realize you were using signed numbers.

else if ( A > 524 ) A= (A * 7 / 5) - 760;
else if ( A > 136 ) A= (A * 7 / 5) - 685;

both result in negative numbers, I assume coming in it is positive and going out signed?

Okay as I expected this actually gets us somewhere

unsigned int fun ( unsigned int A )
{
    unsigned char  B;
    B = (A>>8)&0xFF;
    if(B==0) A = 7;
    if(B==1) A = 18;
    if(B==2) A = 307;
    if(B==3) A = 211;
    return(A);
}

   0:   99 23           and r25, r25
   2:   01 f0           breq    .+0         ; 0x4 <fun+0x4>
   4:   91 30           cpi r25, 0x01   ; 1
   6:   01 f4           brne    .+0         ; 0x8 <fun+0x8>
   8:   82 e1           ldi r24, 0x12   ; 18
   a:   90 e0           ldi r25, 0x00   ; 0
   c:   08 95           ret

we know that it is using a register for the upper half and a register for the lower half so the shift of 8 with or without the and for completeness simply tells the compiler just use the upper 8 bit register.

Now I am assuming an unsigned incoming and signed outgoing since you dont really have anything in the negative range coming in except for the everything less than or equal to 136.

if you make a table of all the answers. (it comes out to 679 unique answers, if you could only limit it to that).

Ugh, nothing is nice about these numbers.

If your input is unsigned which I suspect it is not, you could look at the upper byte using the B=A>>8 thing and if it is greater than 4 then return 1000. And that only really saves you one 16 bit compare becoming an 8 bit compare with this instruction set and the nature of your numbers.

The next optimization is get rid of the floating point and replace it with (A*7)/5. That is the number one thing you can do with this code to make it faster.

If you have a range of values you expect more than others, put those ranges first and it has to hop through less code for the more common values. If it is an even distribution then either of jonks if then else trees makes it easier to read and might save an instruction here and there. The compiler may still move the tree around to suit its needs or an optimization it may think it sees.

The killers are the two ranges that have the math, if you are willing to burn 960 bytes of program space you can pre-build look up tables for these

extern unsigned int LUTONE[];
extern unsigned int LUTTWO[]
unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= LUTONE[A-477];
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= LUTTWO[A-860];
else A= 1000;
    return(A);
}

The lookups look something like this

  30:   88 0f           add r24, r24
  32:   99 1f           adc r25, r25
  34:   e8 2f           mov r30, r24
  36:   f9 2f           mov r31, r25
  38:   e0 50           subi    r30, 0x00   ; 0
  3a:   f0 40           sbci    r31, 0x00   ; 0
  3c:   80 81           ld  r24, Z
  3e:   91 81           ldd r25, Z+1    ; 0x01
  40:   08 95           ret
  42:   88 0f           add r24, r24
  44:   99 1f           adc r25, r25
  46:   e8 2f           mov r30, r24
  48:   f9 2f           mov r31, r25
  4a:   e0 50           subi    r30, 0x00   ; 0
  4c:   f0 40           sbci    r31, 0x00   ; 0
  4e:   80 81           ld  r24, Z
  50:   91 81           ldd r25, Z+1    ; 0x01

which is not much more math than the A*7 the loads on a platform like this are dependent on the speed of the flash relative to the clock it is possibly a single cycle per if not then probably two. But the instruction fetches plus the code to do the divide by 5 with a generic division library for 16 bit numbers should be way way way more costly speed wise. Resource consumption-wise the float libraries are probably smaller than the 900 some bytes, maybe, you can try it and find out. If you are doing float elsewhere then you are burning that flash anyway but that doesnt excuse using float on an AVR esp in code you want to go fast, so still do the integer math if you dont use a look up table.

Short answer get rid of the A * 1.4 and replace with (A * 7) / 5. If you know the data is going to be heavy in certain ranges more than others put those ranges first for performance, otherwise make it more readable with one of jonk's if-then-else trees: The smallest range to largest or largest to smallest. You might save an instruction or two on the deeper cases.

Related Solutions

Electronic – Reusing something unrelated as a hardware adder for code optimization

As a first step I would look at the generated assembly so see what takes the most time. Next determine how much time you have available, and make a rough estimate whether the most optimized solution (assembly, both loops unrolled) can be fast enough. If not,look voor a very different solution (faster chip, different algorithm, make the code that changes the arrays do precompute work like davidcarry suggested, etc).

When unrolling you can at least include an assertion that the #define's are still the value you assumed when you unrolled. But using reasonable macro language unrolling can probably be done by a macro that takes the #defines's into account. BUT after each chance of the #define's you would have to re-test whether you fulfill your timeconstraints, so maybe you don't want automatic adaptation at all.

I don't understand why you are looking for a "can be used as a hardware adder." Your CPU can add two 32 bit values as fast as you can supply them! Which hints at a possible optimization: pack two 16-bit values in a 32-bit word. If you can guarantee that the 16-bit sum never overflows this might speed you up by a factor of 2. Maybe more, this is a 32-bit processor that is probably more comfortable with accessing 32-bit data instead of 16-bit data. Even without the packing trick it might be a good idea to try using 32-bit integers to speed things up.

As for the unrolling: unrolling the inner loop will be more effective than unrolling the outer loop. You might swap the loops, then unroll for the element size, and keep the for() for the chunk size. This way j*4 can be computed once.

You use the x[ i ] style of acessing array elements. I don't know how clever your compiler is, if it takes your index expressions literally it will generate much more efficient code if you use the *p++ style. But again: first check what the compiler does, otherwise you might be doing work for nothing.

Assembler versus C: Never underestimate a compiler. Before you try assembly, first check what the compiler makes, and think why. Better give the compiler every opportunity to optimize than do it yourself in assembly.

Electronic – arduino – Help with delayed smoothing code

A simple smoothed estimator can be obtained simply by a linear blending of the newly sampled value and the existing smoothed value:

smoothed_value := (1 - alpha) * smoothed_value + alpha * new_value

For instance if alpha is 0.20 (20%), then we have:

smoothed_value = 0.8 * smoothed_value + 0.2 * new_value

In other words, take 80% of the existing value, and 20% of the new sample.

By doing some algebraic manipulation, we can show that it can also be calculated this way:

smoothed_value := smoothed_value + alpha * (new_value - smoothed_value)

In other words, update the smoothed value by adding to it just the alpha fraction of the step between it and the new value.

If you make everything integer, and use a power of two for alpha or a multiple of a power of two, you can do this very efficiently on even unsophisticated processors that don't support multiplication or division. For instance, suppose we want to make alpha 3/16. In C, we can do the multiply by 3 and divide by 16 with shifts and adds:

delta = new_value - smoothed_value
smoothed_value += ((delta << 1) + delta) >> 4;

This type of smoothing exhibits exponential decay. It is similar to processes in nature such as, oh, the way an RC circuit charges or discharges to track changes in the applied voltage.

The alpha parameter is in the range [0, 1]. If it is zero, the new values are ignored; there is no convergence. If it is 1, then smoothed_value isn't smoothed at all; it follows new_value instantly. Low values close to zero provide slow convergence.

This algorithm, or something very similar, is used in the TCP protocol to establish a basis for the retransmission time-out based on measurements of the connection round-trip time.

Best Answer

Related Solutions

Electronic – Reusing something unrelated as a hardware adder for code optimization

Electronic – arduino – Help with delayed smoothing code

Related Topic