Electronic – Reusing something unrelated as a hardware adder for code optimization

optimization

I am running into some speed issues in the ISR. Due to the nature of my code, I need to add 20 values (16bit) in an ISR. The code looks like this

 #define Element_size 4
 #define Chunk_size 5
 void add(void)
 { 
   int i;
   int j;

   for (i=0;i<Element_Size;i++)
   { 
     sum[i]=0;
     for (j=0;i<Chunk_Size;j++)
     {
       sum[i]+=data[j*4+i];
     }
    }
  } 

This is not the exact code but the algo is identical. (data is laid out as chunks with size of 4, in this case I have 5 chunks of data. I add the first element of every chunk, later the second element etc. etc.) Assume sum and data are globals. This takes longer time than I have. I am looking for a solution to speed things up. So far, I can think of:

  • Improve the algo: I can make this one turn (or without a loop) instead of two but if the chunk sizes change (now a compile time option), it will break. I will do this as last resort once the code is final and no other changes expected.

  • Move to assembly: I will do this but it will take time, a viable option

  • Make the function inline: This will save some time but not a whole lot as the majority of the time is spent during summation, I will do this anyway.

  • Use some hardware block for summation: I am using STM32F2 family (Arm Cortex M3), it has several good stuff in there but no ALU. I looked at the crc, hash and cyrpto libs but couldn't find anything that jumps out that can be used as a hardware adder.

Especially, if one of the masters among you suggest a hardware assisted way to do this calculation, I would be in debt. I am looking an out of the box way of doing this, not just code optimization. This is really the key question I have.

Best Answer

As a first step I would look at the generated assembly so see what takes the most time. Next determine how much time you have available, and make a rough estimate whether the most optimized solution (assembly, both loops unrolled) can be fast enough. If not,look voor a very different solution (faster chip, different algorithm, make the code that changes the arrays do precompute work like davidcarry suggested, etc).

When unrolling you can at least include an assertion that the #define's are still the value you assumed when you unrolled. But using reasonable macro language unrolling can probably be done by a macro that takes the #defines's into account. BUT after each chance of the #define's you would have to re-test whether you fulfill your timeconstraints, so maybe you don't want automatic adaptation at all.

I don't understand why you are looking for a "can be used as a hardware adder." Your CPU can add two 32 bit values as fast as you can supply them! Which hints at a possible optimization: pack two 16-bit values in a 32-bit word. If you can guarantee that the 16-bit sum never overflows this might speed you up by a factor of 2. Maybe more, this is a 32-bit processor that is probably more comfortable with accessing 32-bit data instead of 16-bit data. Even without the packing trick it might be a good idea to try using 32-bit integers to speed things up.

As for the unrolling: unrolling the inner loop will be more effective than unrolling the outer loop. You might swap the loops, then unroll for the element size, and keep the for() for the chunk size. This way j*4 can be computed once.

You use the x[ i ] style of acessing array elements. I don't know how clever your compiler is, if it takes your index expressions literally it will generate much more efficient code if you use the *p++ style. But again: first check what the compiler does, otherwise you might be doing work for nothing.

Assembler versus C: Never underestimate a compiler. Before you try assembly, first check what the compiler makes, and think why. Better give the compiler every opportunity to optimize than do it yourself in assembly.

Related Topic