The floating point is the killer here. It appears to be doubling the amount of code generated, not that number of instructions is necessarily slower.
It has to make a number of calls to what I assume is a float to int int to float. Interestingly changing from double float (1.4) to single float (1.4F) didnt change the overall result, although again the code is making a lot of calls (I assume to float conversions or operations).
So even the times 7 has to be synthesized.
unsigned int fun ( unsigned int A )
{
A = A*7;
return(A);
}
00000000 <fun>:
0: 28 2f mov r18, r24
2: 39 2f mov r19, r25
4: 22 0f add r18, r18
6: 33 1f adc r19, r19
8: 22 0f add r18, r18
a: 33 1f adc r19, r19
c: 22 0f add r18, r18
e: 33 1f adc r19, r19
10: 42 2f mov r20, r18
12: 53 2f mov r21, r19
14: 48 1b sub r20, r24
16: 59 0b sbc r21, r25
18: 84 2f mov r24, r20
1a: 95 2f mov r25, r21
1c: 08 95 ret
so one might try this as an alternate.
unsigned int fun ( unsigned int A )
{
A = (A<<2)+(A<<1)+A;
return(A);
}
00000000 <fun>:
0: 28 2f mov r18, r24
2: 39 2f mov r19, r25
4: 22 0f add r18, r18
6: 33 1f adc r19, r19
8: 22 0f add r18, r18
a: 33 1f adc r19, r19
c: 48 2f mov r20, r24
e: 59 2f mov r21, r25
10: 44 0f add r20, r20
12: 55 1f adc r21, r21
14: 24 0f add r18, r20
16: 35 1f adc r19, r21
18: 82 0f add r24, r18
1a: 93 1f adc r25, r19
1c: 08 95 ret
same number of instructions. to see this easier
unsigned char fun ( unsigned char A )
{
A = A * 7;
return(A);
}
00000000 <fun>:
0: 98 2f mov r25, r24
2: 99 0f add r25, r25
4: 99 0f add r25, r25
6: 99 0f add r25, r25
8: 98 1b sub r25, r24
a: 89 2f mov r24, r25
c: 08 95 ret
and using shifting and adding
00000000 <fun>:
0: 98 2f mov r25, r24
2: 99 0f add r25, r25
4: 99 0f add r25, r25
6: 28 2f mov r18, r24
8: 22 0f add r18, r18
a: 92 0f add r25, r18
c: 89 0f add r24, r25
e: 08 95 ret
Duh, right A*7 = (A<<3)-1;
00000000 <fun>:
0: 98 2f mov r25, r24 r25 = A
2: 99 0f add r25, r25 r25 = A + A = 2A
4: 99 0f add r25, r25 r25 = 2A + 2A = 4A
6: 99 0f add r25, r25 r25 = 4A + 4A = 8A
8: 98 1b sub r25, r24 r25 = 8A - A = 7A
a: 89 2f mov r24, r25 return 7A
c: 08 95 ret
The (A<<2)|(A<<1)+A source gives this
00000000 <fun>:
0: 98 2f mov r25, r24 r25 = A
2: 99 0f add r25, r25 r25 = A + A = 2A
4: 99 0f add r25, r25 r25 = 2A + 2A = 4A
6: 28 2f mov r18, r24 r18 = A
8: 22 0f add r18, r18 r18 = A + A = 2A
a: 92 0f add r25, r18 r15 = 4A + 2A = 6A
c: 89 0f add r24, r25 return (6A + A)
e: 08 95 ret
But they missed a possible optimization
mov r25, r24 r25 = A
add r25, r25 r25 = A + A = 2A
mov r18, r25 r18 = 2A
add r25, r25 r25 = 2A + 2A = 4A
add r25, r18 r15 = 4A + 2A = 6A
add r24, r25 return (6A + A)
ret
same number of instructions at least, ideally execute the same.
Looking at these trip points
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 1.4) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 1.4) - 760;
else A= 1000;
A is obviously not an 8 bit number is at least 16 bit, if it is a double or a float I give up, game over. Assuming 16 bits...
Here are the constants in hex.
136 0x88
476 0x1dc
524 0x20C
860 0x35C
Looking at this
unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 1;
else if ( A < 476 ) A=2;
else if ( A <= 524 ) A=3;
else A= 4;
return(A);
}
00000000 <fun>:
0: 89 38 cpi r24, 0x89 ; 137
2: 91 05 cpc r25, r1
4: 00 f0 brcs .+0 ; 0x6 <fun+0x6>
6: 8c 3d cpi r24, 0xDC ; 220
8: 21 e0 ldi r18, 0x01 ; 1
a: 92 07 cpc r25, r18
c: 00 f4 brcc .+0 ; 0xe <fun+0xe>
e: 82 e0 ldi r24, 0x02 ; 2
10: 90 e0 ldi r25, 0x00 ; 0
12: 08 95 ret
14: 8d 30 cpi r24, 0x0D ; 13
16: 92 40 sbci r25, 0x02 ; 2
18: 00 f4 brcc .+0 ; 0x1a <fun+0x1a>
1a: 83 e0 ldi r24, 0x03 ; 3
1c: 90 e0 ldi r25, 0x00 ; 0
1e: 08 95 ret
20: 81 e0 ldi r24, 0x01 ; 1
22: 90 e0 ldi r25, 0x00 ; 0
24: 08 95 ret
26: 84 e0 ldi r24, 0x04 ; 4
28: 90 e0 ldi r25, 0x00 ; 0
2a: 08 95 ret
does this help
unsigned int fun ( unsigned int A )
{
unsigned char B = A>>2;
if ( B <= (136>>2) ) A= 1;
else if ( B < (476>>2) ) A=2;
else if ( B <= (524>>2) ) A=3;
else A= 4;
return(A);
}
00000000 <fun>:
0: 96 95 lsr r25
2: 87 95 ror r24
4: 96 95 lsr r25
6: 87 95 ror r24
8: 83 32 cpi r24, 0x23 ; 35
a: 00 f0 brcs .+0 ; 0xc <fun+0xc>
c: 87 37 cpi r24, 0x77 ; 119
e: 00 f4 brcc .+0 ; 0x10 <fun+0x10>
10: 82 e0 ldi r24, 0x02 ; 2
12: 90 e0 ldi r25, 0x00 ; 0
14: 08 95 ret
16: 84 38 cpi r24, 0x84 ; 132
18: 00 f4 brcc .+0 ; 0x1a <fun+0x1a>
1a: 83 e0 ldi r24, 0x03 ; 3
1c: 90 e0 ldi r25, 0x00 ; 0
1e: 08 95 ret
20: 81 e0 ldi r24, 0x01 ; 1
22: 90 e0 ldi r25, 0x00 ; 0
24: 08 95 ret
26: 84 e0 ldi r24, 0x04 ; 4
28: 90 e0 ldi r25, 0x00 ; 0
2a: 08 95 ret
ehh, man this instruction set is painful.
so looking at this
unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 7);
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 7);
else A= 1000;
return(A);
}
the compiler is smart enough to lump the A*7's together.
Like this though
unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 7) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 7) - 760;
else A= 1000;
return(A);
}
It does not combine the A*7s note the 16 bit A*7 looks like this
30: 28 2f mov r18, r24
32: 39 2f mov r19, r25
34: 22 0f add r18, r18
36: 33 1f adc r19, r19
38: 22 0f add r18, r18
3a: 33 1f adc r19, r19
3c: 22 0f add r18, r18
3e: 33 1f adc r19, r19
40: 42 2f mov r20, r18
42: 53 2f mov r21, r19
44: 48 1b sub r20, r24
46: 59 0b sbc r21, r25
48: 84 2f mov r24, r20
4a: 95 2f mov r25, r21
4c: 08 95 ret
for ease of reading and coding the comparisons in order
if ( A > 860 ) A= 1000;
else if ( A > 524 ) A= (A * 1.4) - 760;
else if ( A >= 476 ) A= 500;
else if ( A > 136 ) A= (A * 1.4) - 685;
else A= 100;
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 1.4) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 1.4) - 760;
else A= 1000;
Might shave some instructions for the worst case path, but if your data happens to fall in a bell curve of some sort, most of the samples are say between 500 and 600, then you want to have those ranges compared first then let the others fall through. it may or may not be worth trying to get the ranges that end with A*1.4 as the last two, so that before those compares you do a single B = A * 1.4. you do the compare of one of the ranges then either subtract A = B - 685 or A = B - 760, saving some codes space, not necessarily saving on performance because with an if-then-else tree like this re-ordering, etc is not going to save you much, some paths are shorter some are longer, you may add a few or save a few instructions here and there, but it is still a pretty long path for this code on this architecture.
I didnt do a full link with a library, but looking at readelf:
11: 00000000 0 NOTYPE GLOBAL DEFAULT UND __floatunsisf
12: 00000000 0 NOTYPE GLOBAL DEFAULT UND __mulsf3
13: 00000000 0 NOTYPE GLOBAL DEFAULT UND __subsf3
14: 00000000 0 NOTYPE GLOBAL DEFAULT UND __fixunssfsi
Those floating point libraries are going to kill you and that answers something, even though the C standard says that A * 1.4 is a double operation the compiler is clearly doing single float, ignoring the standard.
Changing the A * 1.4 to A * 7 / 5 implements the long string of adds to implement the times 7 and does the divide with a gcc library
12: 00000000 0 NOTYPE GLOBAL DEFAULT UND __udivmodhi4
That is going to save you on overall code space and should save on execution when you would have normally had to deal with generic floating point libraries.
Man, I didnt realize you were using signed numbers.
else if ( A > 524 ) A= (A * 7 / 5) - 760;
else if ( A > 136 ) A= (A * 7 / 5) - 685;
both result in negative numbers, I assume coming in it is positive and going out signed?
Okay as I expected this actually gets us somewhere
unsigned int fun ( unsigned int A )
{
unsigned char B;
B = (A>>8)&0xFF;
if(B==0) A = 7;
if(B==1) A = 18;
if(B==2) A = 307;
if(B==3) A = 211;
return(A);
}
0: 99 23 and r25, r25
2: 01 f0 breq .+0 ; 0x4 <fun+0x4>
4: 91 30 cpi r25, 0x01 ; 1
6: 01 f4 brne .+0 ; 0x8 <fun+0x8>
8: 82 e1 ldi r24, 0x12 ; 18
a: 90 e0 ldi r25, 0x00 ; 0
c: 08 95 ret
we know that it is using a register for the upper half and a register for the lower half so the shift of 8 with or without the and for completeness simply tells the compiler just use the upper 8 bit register.
Now I am assuming an unsigned incoming and signed outgoing since you dont really have anything in the negative range coming in except for the everything less than or equal to 136.
if you make a table of all the answers. (it comes out to 679 unique answers, if you could only limit it to that).
Ugh, nothing is nice about these numbers.
If your input is unsigned which I suspect it is not, you could look at the upper byte using the B=A>>8 thing and if it is greater than 4 then return 1000. And that only really saves you one 16 bit compare becoming an 8 bit compare with this instruction set and the nature of your numbers.
The next optimization is get rid of the floating point and replace it with (A*7)/5. That is the number one thing you can do with this code to make it faster.
If you have a range of values you expect more than others, put those ranges first and it has to hop through less code for the more common values. If it is an even distribution then either of jonks if then else trees makes it easier to read and might save an instruction here and there. The compiler may still move the tree around to suit its needs or an optimization it may think it sees.
The killers are the two ranges that have the math, if you are willing to burn 960 bytes of program space you can pre-build look up tables for these
extern unsigned int LUTONE[];
extern unsigned int LUTTWO[]
unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= LUTONE[A-477];
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= LUTTWO[A-860];
else A= 1000;
return(A);
}
The lookups look something like this
30: 88 0f add r24, r24
32: 99 1f adc r25, r25
34: e8 2f mov r30, r24
36: f9 2f mov r31, r25
38: e0 50 subi r30, 0x00 ; 0
3a: f0 40 sbci r31, 0x00 ; 0
3c: 80 81 ld r24, Z
3e: 91 81 ldd r25, Z+1 ; 0x01
40: 08 95 ret
42: 88 0f add r24, r24
44: 99 1f adc r25, r25
46: e8 2f mov r30, r24
48: f9 2f mov r31, r25
4a: e0 50 subi r30, 0x00 ; 0
4c: f0 40 sbci r31, 0x00 ; 0
4e: 80 81 ld r24, Z
50: 91 81 ldd r25, Z+1 ; 0x01
which is not much more math than the A*7 the loads on a platform like this are dependent on the speed of the flash relative to the clock it is possibly a single cycle per if not then probably two. But the instruction fetches plus the code to do the divide by 5 with a generic division library for 16 bit numbers should be way way way more costly speed wise. Resource consumption-wise the float libraries are probably smaller than the 900 some bytes, maybe, you can try it and find out. If you are doing float elsewhere then you are burning that flash anyway but that doesnt excuse using float on an AVR esp in code you want to go fast, so still do the integer math if you dont use a look up table.
Short answer get rid of the A * 1.4 and replace with (A * 7) / 5. If you know the data is going to be heavy in certain ranges more than others put those ranges first for performance, otherwise make it more readable with one of jonk's if-then-else trees: The smallest range to largest or largest to smallest. You might save an instruction or two on the deeper cases.
Best Answer
As a first step I would look at the generated assembly so see what takes the most time. Next determine how much time you have available, and make a rough estimate whether the most optimized solution (assembly, both loops unrolled) can be fast enough. If not,look voor a very different solution (faster chip, different algorithm, make the code that changes the arrays do precompute work like davidcarry suggested, etc).
When unrolling you can at least include an assertion that the #define's are still the value you assumed when you unrolled. But using reasonable macro language unrolling can probably be done by a macro that takes the #defines's into account. BUT after each chance of the #define's you would have to re-test whether you fulfill your timeconstraints, so maybe you don't want automatic adaptation at all.
I don't understand why you are looking for a "can be used as a hardware adder." Your CPU can add two 32 bit values as fast as you can supply them! Which hints at a possible optimization: pack two 16-bit values in a 32-bit word. If you can guarantee that the 16-bit sum never overflows this might speed you up by a factor of 2. Maybe more, this is a 32-bit processor that is probably more comfortable with accessing 32-bit data instead of 16-bit data. Even without the packing trick it might be a good idea to try using 32-bit integers to speed things up.
As for the unrolling: unrolling the inner loop will be more effective than unrolling the outer loop. You might swap the loops, then unroll for the element size, and keep the for() for the chunk size. This way j*4 can be computed once.
You use the x[ i ] style of acessing array elements. I don't know how clever your compiler is, if it takes your index expressions literally it will generate much more efficient code if you use the *p++ style. But again: first check what the compiler does, otherwise you might be doing work for nothing.
Assembler versus C: Never underestimate a compiler. Before you try assembly, first check what the compiler makes, and think why. Better give the compiler every opportunity to optimize than do it yourself in assembly.