Electronic – Most efficient way to split data into segments and fill them into bit masks

cortex-m3microcontroller

Background:

  • I need to write 32 bit data to a receiving instrument, the source of the data is a 32-bit read-only register. This project is time-critical, every clock cycle matters.
  • That requires 32 DIO pins, my MCU has enough pins, BUT those pins are controlled by 4 different PIO (Parallel Input/Output Controller).
  • The MCU is AT91SAM3X8E. Each PIO controllers have 32 channels, but only certain pins are available for each PIO line. They are defined by 4 bitmasks. Any thing written to unavailable pins (bit mask value = 0) will be ignored.
  • So I have to split the 32 bit to 4 parts, and write the parts to 4 PIO.

Context:

data = "some 32bit register value"
mask_PIOA = 0x0018C080;  // 5 bits
mask_PIOB = 0x04204000;  // 3 bits
mask_PIOC = 0x000FF3FE;  // 17 bits
mask_PIOD = 0x0000064F;  // 7 bits

Example of what should be written to PIOA:

Data segment                PIO mask                  Desired result       
xxxx xxxx xxxx xxxx         0000 0010 0001 1000       xxxx xx1x xx0x xxxx        
xxxx xxxx xxx1 0111         1100 0000 1000 0000       11xx xxxx 1xxx xxxx        

x = dont care

Question:

  • What is the most efficient (cost least clock cycles) way to split the 32 bit data into 4 parts, and fill the 4 parts into 4 masks ("fill" means put the data segment in to the mask bit-by-bit, when the mask value is "1") ?

Comments:

This may not be the answer you want but it's what the question asks…
"Most efficient" in clock cycles would be to create your own 32-bit
PIO in an FPGA with an MCU core. Xilinx Zynq, Microchip Smartfusion
etc offer ARM Cortex CPU + FPGA on the same chip. Then you could write
all 32 bits at once. – Brian Drummond

This is actually a good idea, I'm looking into getting a MCU+FPGA board now. But I still want to resolved this with the MCU I have.

Port A – 5 bits using a lookup table with 32 entries

Port B – 3 bits using a lookup table with 8 entries

Port C – no lookup table, just map the bits directly to the port with rotate/mask operations. First copy the 32 bit data, rotate it
left once and mask with 0x0000003FE to map bits 0-8 to port bits 1-9,
then rotate another copy left 3 times and mask with 0x000FF000 to map
bits 9-16 to port bits 12-19. Finally combine the two masked values
with logical OR, and send the result to the port through a mask of
0x000FF3FE.

Port D – 7 bits using a lookup table with 128 entries.

This seems most reasonable, thanks !

I generate the lookup table in higher language, because I'm not proficient with C++:

const uint32_t LutA[32] =   {0x00000000 ,0x00000080 ,0x00004000 ,0x00004080 ,0x00008000 ,0x00008080 ,0x0000C000 ,0x0000C080 
                            ,0x00080000 ,0x00080080 ,0x00084000 ,0x00084080 ,0x00088000 ,0x00088080 ,0x0008C000 ,0x0008C080 
                            ,0x00100000 ,0x00100080 ,0x00104000 ,0x00104080 ,0x00108000 ,0x00108080 ,0x0010C000 ,0x0010C080
                            ,0x00180000 ,0x00180080 ,0x00184000 ,0x00184080 ,0x00188000 ,0x00188080 ,0x0018C000 ,0x0018C080};
                            
const uint32_t LutB[8] =    {0x00000000 ,0x00004000 ,0x00200000 ,0x00204000 ,0x04000000 ,0x04004000 ,0x04200000 ,0x04204000};

const uint32_t LutD[128] =  {0x00000000 ,0x00000001 ,0x00000002 ,0x00000003 ,0x00000004 ,0x00000005 ,0x00000006 ,0x00000007 
                            ,0x00000008 ,0x00000009 ,0x0000000A ,0x0000000B ,0x0000000C ,0x0000000D ,0x0000000E ,0x0000000F 
                            ,0x00000040 ,0x00000041 ,0x00000042 ,0x00000043 ,0x00000044 ,0x00000045 ,0x00000046 ,0x00000047 
                            ,0x00000048 ,0x00000049 ,0x0000004A ,0x0000004B ,0x0000004C ,0x0000004D ,0x0000004E ,0x0000004F 
                            ,0x00000200 ,0x00000201 ,0x00000202 ,0x00000203 ,0x00000204 ,0x00000205 ,0x00000206 ,0x00000207 
                            ,0x00000208 ,0x00000209 ,0x0000020A ,0x0000020B ,0x0000020C ,0x0000020D ,0x0000020E ,0x0000020F 
                            ,0x00000240 ,0x00000241 ,0x00000242 ,0x00000243 ,0x00000244 ,0x00000245 ,0x00000246 ,0x00000247 
                            ,0x00000248 ,0x00000249 ,0x0000024A ,0x0000024B ,0x0000024C ,0x0000024D ,0x0000024E ,0x0000024F 
                            ,0x00000400 ,0x00000401 ,0x00000402 ,0x00000403 ,0x00000404 ,0x00000405 ,0x00000406 ,0x00000407 
                            ,0x00000408 ,0x00000409 ,0x0000040A ,0x0000040B ,0x0000040C ,0x0000040D ,0x0000040E ,0x0000040F 
                            ,0x00000440 ,0x00000441 ,0x00000442 ,0x00000443 ,0x00000444 ,0x00000445 ,0x00000446 ,0x00000447 
                            ,0x00000448 ,0x00000449 ,0x0000044A ,0x0000044B ,0x0000044C ,0x0000044D ,0x0000044E ,0x0000044F 
                            ,0x00000600 ,0x00000601 ,0x00000602 ,0x00000603 ,0x00000604 ,0x00000605 ,0x00000606 ,0x00000607 
                            ,0x00000608 ,0x00000609 ,0x0000060A ,0x0000060B ,0x0000060C ,0x0000060D ,0x0000060E ,0x0000060F 
                            ,0x00000640 ,0x00000641 ,0x00000642 ,0x00000643 ,0x00000644 ,0x00000645 ,0x00000646 ,0x00000647 
                            ,0x00000648 ,0x00000649 ,0x0000064A ,0x0000064B ,0x0000064C ,0x0000064D ,0x0000064E ,0x0000064F};

Then the 4 data to write to 4 ports:

DataA = 0000001F & (Data>>20)->LutA; //mask then point, write 20-24 bits
DataB = 00000007 & (Data|Data>>9) ->LutB; //shift, mask, then point, write 0,10,11 bits
DataC =             Data           ; //write 1-9,12-19 bits
DataD = 0000007F & (Data>>25)->LutD; //shift, mask, then point, write 25-31

Haven't tested yet because I don't have access to the hardware right now.

Thanks in advance !

Best Answer

You have 3 basic problems:-

  1. How to map the 32 bits of the source word onto different bits in four 32 bit PIO registers.

  2. How to avoid disturbing other bits in the PIO registers.

  3. How to do this as quickly as possible.

For final analysis you must examine the machine code to see what algorithm uses the smallest number of cycles, but first let's consider some strategies that might help to speed it up.

For example if you needed to set 4 'random' bits in a port, use 4 bits of the 32 bit data to index a lookup table that returns a 32 bit number with the appropriate bits set in the port. The lookup table would have 16 entries in it, one for each possible bit combination. Extracting the mapped bits would require 1 rotate and 1 mask operation to generate the index, and 1 read to get the result. You would then read the PIO register to get the other bits, AND it with the bit mask, OR it with your mapped bits, and finally write the result to the port.

If some ports have more or less bits to fill then the lookup tables would be larger or smaller, limited by available ROM (eg. 8 bits would need 256 * 4 = 1024 bytes). If enough ROM is available you might be able to look up 2 ports at once by combining their index bits and extracting two consecutive 32 bit mapping words (which would need 2048 bytes for 8 bits).

By using instructions that do several operations at once this should only take a few instructions, certainly fewer than testing each bit and setting its port pin individually. However if a port only has 1 or 2 pins to set it might be quicker to do them individually.

If you can arrange bits in a port to be adjacent then a lookup table won't be needed, which will make the code even faster.

I'll leave it to you to write the code. You could do it in C and then examine the compiled code to see how efficiently it was translated into machine code. If it looks tight enough then it might be OK to use as is, else if you are familiar with ARM machine code you could try optimizing it by hand (and if not, now is the time to learn!). Or there is the fun way, write it in assembly language from scratch!

ETA

For your desired port bit assignments I would try this:-

  • Port A - 5 bits using a lookup table with 32 entries

  • Port B - 3 bits using a lookup table with 8 entries

  • Port C - no lookup table, just map the bits directly to the port with rotate/mask operations. First copy the 32 bit data, rotate it left once and mask with 0x0000003FE to map bits 0-8 to port bits 1-9, then rotate another copy left 3 times and mask with 0x000FF000 to map bits 9-16 to port bits 12-19. Finally combine the two masked values with logical OR, and send the result to the port through a mask of 0x000FF3FE.

  • Port D - 7 bits using a lookup table with 128 entries.

BTW I see in the AT91SAM3X8E datasheet that PIO output registers can be written to through a mask which is set in another PIO register. If the masks can be set permanently (without affecting operation of other code) then it could shave a few cycles off because you won't need to read the ports.