Electronic – Reusing something unrelated as a hardware adder for code optimization

optimization

I am running into some speed issues in the ISR. Due to the nature of my code, I need to add 20 values (16bit) in an ISR. The code looks like this

 #define Element_size 4
 #define Chunk_size 5
 void add(void)
 { 
   int i;
   int j;

   for (i=0;i<Element_Size;i++)
   { 
     sum[i]=0;
     for (j=0;i<Chunk_Size;j++)
     {
       sum[i]+=data[j*4+i];
     }
    }
  }

This is not the exact code but the algo is identical. (data is laid out as chunks with size of 4, in this case I have 5 chunks of data. I add the first element of every chunk, later the second element etc. etc.) Assume sum and data are globals. This takes longer time than I have. I am looking for a solution to speed things up. So far, I can think of:

Improve the algo: I can make this one turn (or without a loop) instead of two but if the chunk sizes change (now a compile time option), it will break. I will do this as last resort once the code is final and no other changes expected.
Move to assembly: I will do this but it will take time, a viable option
Make the function inline: This will save some time but not a whole lot as the majority of the time is spent during summation, I will do this anyway.
Use some hardware block for summation: I am using STM32F2 family (Arm Cortex M3), it has several good stuff in there but no ALU. I looked at the crc, hash and cyrpto libs but couldn't find anything that jumps out that can be used as a hardware adder.

Especially, if one of the masters among you suggest a hardware assisted way to do this calculation, I would be in debt. I am looking an out of the box way of doing this, not just code optimization. This is really the key question I have.

Best Answer

As a first step I would look at the generated assembly so see what takes the most time. Next determine how much time you have available, and make a rough estimate whether the most optimized solution (assembly, both loops unrolled) can be fast enough. If not,look voor a very different solution (faster chip, different algorithm, make the code that changes the arrays do precompute work like davidcarry suggested, etc).

When unrolling you can at least include an assertion that the #define's are still the value you assumed when you unrolled. But using reasonable macro language unrolling can probably be done by a macro that takes the #defines's into account. BUT after each chance of the #define's you would have to re-test whether you fulfill your timeconstraints, so maybe you don't want automatic adaptation at all.

I don't understand why you are looking for a "can be used as a hardware adder." Your CPU can add two 32 bit values as fast as you can supply them! Which hints at a possible optimization: pack two 16-bit values in a 32-bit word. If you can guarantee that the 16-bit sum never overflows this might speed you up by a factor of 2. Maybe more, this is a 32-bit processor that is probably more comfortable with accessing 32-bit data instead of 16-bit data. Even without the packing trick it might be a good idea to try using 32-bit integers to speed things up.

As for the unrolling: unrolling the inner loop will be more effective than unrolling the outer loop. You might swap the loops, then unroll for the element size, and keep the for() for the chunk size. This way j*4 can be computed once.

You use the x[ i ] style of acessing array elements. I don't know how clever your compiler is, if it takes your index expressions literally it will generate much more efficient code if you use the *p++ style. But again: first check what the compiler does, otherwise you might be doing work for nothing.

Assembler versus C: Never underestimate a compiler. Before you try assembly, first check what the compiler makes, and think why. Better give the compiler every opportunity to optimize than do it yourself in assembly.

Related Solutions

Electronic – AVR GCC: How to improve code optimization

One method is to use direct loads to the halves of period. While this looks complicated in C, it usually will generate very tight assembly, i.e. 2 loads and 2 stores.

((uint8_t*)(&period))[0] = TCNT0L;
((uint8_t*)(&period))[1] = TCNT0H;

Sometimes using the array math can cause issues so you could try this:

*((uint8_t*)(&period)) = TCNT0L;
*((uint8_t*)(&period) + 1) = TCNT0H;

This actually produces optimal code. Look at how there are 12 bytes used.

  ((unsigned char*)(&period))[0] = TCNT0L;
  dc:   82 b7           in  r24, 0x32   ; 50
  de:   e6 e8           ldi r30, 0x86   ; 134
  e0:   f0 e0           ldi r31, 0x00   ; 0
  e2:   80 83           st  Z, r24
    ((unsigned char*)(&period))[1] = TCNT0H;
  e4:   84 b3           in  r24, 0x14   ; 20
  e6:   81 83           std Z+1, r24    ; 0x01

If you did this with assembly, it would probably seem better to do it like this. It is also 12 bytes, so they are equivalent.

  dc:   82 b7           in  r24, 0x32   ; 50
  de:   80 93 86 00     sts 0x0086, r24
  e2:   84 b3           in  r24, 0x14   ; 20
  e4:   80 93 87 00     sts 0x0087, r24

Of course, when I say "equivalent", I mean regarding code size. If time is more important, then you have to look at the cycles. In this case it looks like the assembly version is 6 cycles and the compiler's version is 8 cycles.

Electrical – Code Optimization of IF Statements

The floating point is the killer here. It appears to be doubling the amount of code generated, not that number of instructions is necessarily slower.

It has to make a number of calls to what I assume is a float to int int to float. Interestingly changing from double float (1.4) to single float (1.4F) didnt change the overall result, although again the code is making a lot of calls (I assume to float conversions or operations).

So even the times 7 has to be synthesized.

unsigned int fun ( unsigned int A )
{
A = A*7;
return(A);
}

00000000 <fun>:
   0:   28 2f           mov r18, r24
   2:   39 2f           mov r19, r25
   4:   22 0f           add r18, r18
   6:   33 1f           adc r19, r19
   8:   22 0f           add r18, r18
   a:   33 1f           adc r19, r19
   c:   22 0f           add r18, r18
   e:   33 1f           adc r19, r19
  10:   42 2f           mov r20, r18
  12:   53 2f           mov r21, r19
  14:   48 1b           sub r20, r24
  16:   59 0b           sbc r21, r25
  18:   84 2f           mov r24, r20
  1a:   95 2f           mov r25, r21
  1c:   08 95           ret

so one might try this as an alternate.

unsigned int fun ( unsigned int A )
{
A = (A<<2)+(A<<1)+A;
return(A);
}

00000000 <fun>:
   0:   28 2f           mov r18, r24
   2:   39 2f           mov r19, r25
   4:   22 0f           add r18, r18
   6:   33 1f           adc r19, r19
   8:   22 0f           add r18, r18
   a:   33 1f           adc r19, r19
   c:   48 2f           mov r20, r24
   e:   59 2f           mov r21, r25
  10:   44 0f           add r20, r20
  12:   55 1f           adc r21, r21
  14:   24 0f           add r18, r20
  16:   35 1f           adc r19, r21
  18:   82 0f           add r24, r18
  1a:   93 1f           adc r25, r19
  1c:   08 95           ret

same number of instructions. to see this easier

unsigned char fun ( unsigned char A )
{
A = A * 7;
return(A);
}

00000000 <fun>:
   0:   98 2f           mov r25, r24
   2:   99 0f           add r25, r25
   4:   99 0f           add r25, r25
   6:   99 0f           add r25, r25
   8:   98 1b           sub r25, r24
   a:   89 2f           mov r24, r25
   c:   08 95           ret

and using shifting and adding

00000000 <fun>:
   0:   98 2f           mov r25, r24
   2:   99 0f           add r25, r25
   4:   99 0f           add r25, r25
   6:   28 2f           mov r18, r24
   8:   22 0f           add r18, r18
   a:   92 0f           add r25, r18
   c:   89 0f           add r24, r25
   e:   08 95           ret

Duh, right A*7 = (A<<3)-1;

00000000 <fun>:
   0:   98 2f           mov r25, r24  r25 = A
   2:   99 0f           add r25, r25  r25 = A + A = 2A
   4:   99 0f           add r25, r25  r25 = 2A + 2A = 4A
   6:   99 0f           add r25, r25  r25 = 4A + 4A = 8A
   8:   98 1b           sub r25, r24  r25 = 8A - A = 7A
   a:   89 2f           mov r24, r25  return 7A
   c:   08 95           ret

The (A<<2)|(A<<1)+A source gives this

00000000 <fun>:
   0:   98 2f           mov r25, r24  r25 = A
   2:   99 0f           add r25, r25  r25 = A + A = 2A
   4:   99 0f           add r25, r25  r25 = 2A + 2A = 4A
   6:   28 2f           mov r18, r24  r18 = A
   8:   22 0f           add r18, r18  r18 = A + A = 2A
   a:   92 0f           add r25, r18  r15 = 4A + 2A = 6A
   c:   89 0f           add r24, r25  return (6A + A)
   e:   08 95           ret

But they missed a possible optimization

   mov  r25, r24  r25 = A
   add  r25, r25  r25 = A + A = 2A
   mov  r18, r25  r18 = 2A
   add  r25, r25  r25 = 2A + 2A = 4A
   add  r25, r18  r15 = 4A + 2A = 6A
   add  r24, r25  return (6A + A)
   ret

same number of instructions at least, ideally execute the same. Looking at these trip points

if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 1.4) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 1.4) - 760;
else A= 1000;

A is obviously not an 8 bit number is at least 16 bit, if it is a double or a float I give up, game over. Assuming 16 bits...

Here are the constants in hex.

136 0x88
476 0x1dc
524 0x20C
860 0x35C

Looking at this

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 1;
else if ( A < 476 ) A=2;
else if ( A <= 524 ) A=3;
else A= 4;
return(A);
}

00000000 <fun>:
   0:   89 38           cpi r24, 0x89   ; 137
   2:   91 05           cpc r25, r1
   4:   00 f0           brcs    .+0         ; 0x6 <fun+0x6>
   6:   8c 3d           cpi r24, 0xDC   ; 220
   8:   21 e0           ldi r18, 0x01   ; 1
   a:   92 07           cpc r25, r18
   c:   00 f4           brcc    .+0         ; 0xe <fun+0xe>
   e:   82 e0           ldi r24, 0x02   ; 2
  10:   90 e0           ldi r25, 0x00   ; 0
  12:   08 95           ret
  14:   8d 30           cpi r24, 0x0D   ; 13
  16:   92 40           sbci    r25, 0x02   ; 2
  18:   00 f4           brcc    .+0         ; 0x1a <fun+0x1a>
  1a:   83 e0           ldi r24, 0x03   ; 3
  1c:   90 e0           ldi r25, 0x00   ; 0
  1e:   08 95           ret
  20:   81 e0           ldi r24, 0x01   ; 1
  22:   90 e0           ldi r25, 0x00   ; 0
  24:   08 95           ret
  26:   84 e0           ldi r24, 0x04   ; 4
  28:   90 e0           ldi r25, 0x00   ; 0
  2a:   08 95           ret

does this help

    unsigned int fun ( unsigned int A )
    {
        unsigned char B = A>>2;
        if ( B <= (136>>2) ) A= 1;
        else if ( B < (476>>2) ) A=2;
        else if ( B <= (524>>2) ) A=3;
        else A= 4;
        return(A);
    }

00000000 <fun>:
   0:   96 95           lsr r25
   2:   87 95           ror r24
   4:   96 95           lsr r25
   6:   87 95           ror r24
   8:   83 32           cpi r24, 0x23   ; 35
   a:   00 f0           brcs    .+0         ; 0xc <fun+0xc>
   c:   87 37           cpi r24, 0x77   ; 119
   e:   00 f4           brcc    .+0         ; 0x10 <fun+0x10>
  10:   82 e0           ldi r24, 0x02   ; 2
  12:   90 e0           ldi r25, 0x00   ; 0
  14:   08 95           ret
  16:   84 38           cpi r24, 0x84   ; 132
  18:   00 f4           brcc    .+0         ; 0x1a <fun+0x1a>
  1a:   83 e0           ldi r24, 0x03   ; 3
  1c:   90 e0           ldi r25, 0x00   ; 0
  1e:   08 95           ret
  20:   81 e0           ldi r24, 0x01   ; 1
  22:   90 e0           ldi r25, 0x00   ; 0
  24:   08 95           ret
  26:   84 e0           ldi r24, 0x04   ; 4
  28:   90 e0           ldi r25, 0x00   ; 0
  2a:   08 95           ret

ehh, man this instruction set is painful.

so looking at this

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 7);
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 7);
else A= 1000;
return(A);
}

the compiler is smart enough to lump the A*7's together.

Like this though

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 7) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 7) - 760;
else A= 1000;
return(A);
}

It does not combine the A*7s note the 16 bit A*7 looks like this

  30:   28 2f           mov r18, r24
  32:   39 2f           mov r19, r25
  34:   22 0f           add r18, r18
  36:   33 1f           adc r19, r19
  38:   22 0f           add r18, r18
  3a:   33 1f           adc r19, r19
  3c:   22 0f           add r18, r18
  3e:   33 1f           adc r19, r19
  40:   42 2f           mov r20, r18
  42:   53 2f           mov r21, r19
  44:   48 1b           sub r20, r24
  46:   59 0b           sbc r21, r25
  48:   84 2f           mov r24, r20
  4a:   95 2f           mov r25, r21
  4c:   08 95           ret

for ease of reading and coding the comparisons in order

if ( A > 860 ) A= 1000;
else if ( A > 524 ) A= (A * 1.4) - 760;
else if ( A >= 476 ) A= 500;
else if ( A > 136 ) A= (A * 1.4) - 685;
else A= 100;

if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 1.4) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 1.4) - 760;
else A= 1000;

Might shave some instructions for the worst case path, but if your data happens to fall in a bell curve of some sort, most of the samples are say between 500 and 600, then you want to have those ranges compared first then let the others fall through. it may or may not be worth trying to get the ranges that end with A*1.4 as the last two, so that before those compares you do a single B = A * 1.4. you do the compare of one of the ranges then either subtract A = B - 685 or A = B - 760, saving some codes space, not necessarily saving on performance because with an if-then-else tree like this re-ordering, etc is not going to save you much, some paths are shorter some are longer, you may add a few or save a few instructions here and there, but it is still a pretty long path for this code on this architecture.

I didnt do a full link with a library, but looking at readelf:

11: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND __floatunsisf
12: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND __mulsf3
13: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND __subsf3
14: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND __fixunssfsi

Those floating point libraries are going to kill you and that answers something, even though the C standard says that A * 1.4 is a double operation the compiler is clearly doing single float, ignoring the standard.

Changing the A * 1.4 to A * 7 / 5 implements the long string of adds to implement the times 7 and does the divide with a gcc library

12: 00000000     0 NOTYPE  GLOBAL DEFAULT  UND __udivmodhi4

That is going to save you on overall code space and should save on execution when you would have normally had to deal with generic floating point libraries.

Man, I didnt realize you were using signed numbers.

else if ( A > 524 ) A= (A * 7 / 5) - 760;
else if ( A > 136 ) A= (A * 7 / 5) - 685;

both result in negative numbers, I assume coming in it is positive and going out signed?

Okay as I expected this actually gets us somewhere

unsigned int fun ( unsigned int A )
{
    unsigned char  B;
    B = (A>>8)&0xFF;
    if(B==0) A = 7;
    if(B==1) A = 18;
    if(B==2) A = 307;
    if(B==3) A = 211;
    return(A);
}

   0:   99 23           and r25, r25
   2:   01 f0           breq    .+0         ; 0x4 <fun+0x4>
   4:   91 30           cpi r25, 0x01   ; 1
   6:   01 f4           brne    .+0         ; 0x8 <fun+0x8>
   8:   82 e1           ldi r24, 0x12   ; 18
   a:   90 e0           ldi r25, 0x00   ; 0
   c:   08 95           ret

we know that it is using a register for the upper half and a register for the lower half so the shift of 8 with or without the and for completeness simply tells the compiler just use the upper 8 bit register.

Now I am assuming an unsigned incoming and signed outgoing since you dont really have anything in the negative range coming in except for the everything less than or equal to 136.

if you make a table of all the answers. (it comes out to 679 unique answers, if you could only limit it to that).

Ugh, nothing is nice about these numbers.

If your input is unsigned which I suspect it is not, you could look at the upper byte using the B=A>>8 thing and if it is greater than 4 then return 1000. And that only really saves you one 16 bit compare becoming an 8 bit compare with this instruction set and the nature of your numbers.

The next optimization is get rid of the floating point and replace it with (A*7)/5. That is the number one thing you can do with this code to make it faster.

If you have a range of values you expect more than others, put those ranges first and it has to hop through less code for the more common values. If it is an even distribution then either of jonks if then else trees makes it easier to read and might save an instruction here and there. The compiler may still move the tree around to suit its needs or an optimization it may think it sees.

The killers are the two ranges that have the math, if you are willing to burn 960 bytes of program space you can pre-build look up tables for these

extern unsigned int LUTONE[];
extern unsigned int LUTTWO[]
unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= LUTONE[A-477];
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= LUTTWO[A-860];
else A= 1000;
    return(A);
}

The lookups look something like this

  30:   88 0f           add r24, r24
  32:   99 1f           adc r25, r25
  34:   e8 2f           mov r30, r24
  36:   f9 2f           mov r31, r25
  38:   e0 50           subi    r30, 0x00   ; 0
  3a:   f0 40           sbci    r31, 0x00   ; 0
  3c:   80 81           ld  r24, Z
  3e:   91 81           ldd r25, Z+1    ; 0x01
  40:   08 95           ret
  42:   88 0f           add r24, r24
  44:   99 1f           adc r25, r25
  46:   e8 2f           mov r30, r24
  48:   f9 2f           mov r31, r25
  4a:   e0 50           subi    r30, 0x00   ; 0
  4c:   f0 40           sbci    r31, 0x00   ; 0
  4e:   80 81           ld  r24, Z
  50:   91 81           ldd r25, Z+1    ; 0x01

which is not much more math than the A*7 the loads on a platform like this are dependent on the speed of the flash relative to the clock it is possibly a single cycle per if not then probably two. But the instruction fetches plus the code to do the divide by 5 with a generic division library for 16 bit numbers should be way way way more costly speed wise. Resource consumption-wise the float libraries are probably smaller than the 900 some bytes, maybe, you can try it and find out. If you are doing float elsewhere then you are burning that flash anyway but that doesnt excuse using float on an AVR esp in code you want to go fast, so still do the integer math if you dont use a look up table.

Short answer get rid of the A * 1.4 and replace with (A * 7) / 5. If you know the data is going to be heavy in certain ranges more than others put those ranges first for performance, otherwise make it more readable with one of jonk's if-then-else trees: The smallest range to largest or largest to smallest. You might save an instruction or two on the deeper cases.

Best Answer

Related Solutions

Electronic – AVR GCC: How to improve code optimization

Electrical – Code Optimization of IF Statements

Related Topic