Electronic – Cortex M4 memory management suggestions: best data/code placement

armcortex-mlpcmemoryram

I'm trying to implement a rather complex (at least for me!) system on a Cortex M4 mcu: LPC4370. This one has HighSpeed ADC (up to 80Msps), DMA and DSP (Single Instruction Multiple Data) instructions.
Here's what I want to do:

  • Let the ADC sample in a continuous manner (at least at 10 Msps)
  • Move the data to SRAM
  • Process them in real-time w/ cortex M4 DSP (pulse shaping filtering)

MCU clock is 204 MHz and for now let's assume ADC fs is not a design spec, but ideally I would like it to be as high as it can. So I need the code to be as fast as I can.
Here's the MCU memory set up.

enter image description here
And here the AHB MULTILAYER MATRIX

enter image description here
As of now I am considering more the general firmware architecture, and specifically memory management.
Some considerations:

  • I don't want core M4 and DMA to fight for memory: I need the DMA to be able to write data while the M4 is doing the processing
  • most of the code and all acquired data should be in SRAM for faster execution
  • Instruction fetch should not interfere with data storing (DMA) and processing (M4)

In the LPC4370 user guide (chap.2):

To optimize the CPU performance, the ARM Cortex-M4 has three buses for Instruction
(code) (I) access, Data (D) access, and System (S) access. The I- and D-bus access
memory space is located below 0x2000 0000, the S-bus accesses the memory space
staring from 0x2000 0000. When instructions and data are kept in separate memories,
then code and data accesses can be done in parallel in one cycle. When code and data
are kept in the same memory, then instructions that load or store data may take two
cycles.

My idea at the moment is to hold the sampled data in two different buffers placed in two different memory areas (like LocRam128 and LocRam72) and "ping pong" DMA and M4 between these two areas.
Only problem is that these are the two areas used for instruction (I). So I guest code should be placed here also, and this is not good to me.
I wonder how could I use RAMAHB32 effectively since it is only connected to M4 system bus (S) and not to data (D) nor instructions (I).

Any hints?

Best Answer

Ok, since you are unable to share more details. I'll give you some general points:

  1. the Scatter-gather functionality in the DMA module is going to save your bacon; Take time to understand how it works and how to use it.
  2. If you are worried about memory accesses, just go ahead and place your ping and pong buffers in different memories. Scatter-gather will help facilitate this.
  3. After the above, don't worry about bus contention until you get there. Realistically: if, while using separate memories, bus contention is STILL your bottle neck, then you've got the wrong chip. Pure and simple.
  4. Invest in a Segger j-trace and implement streaming trace in your debug board. This will help you when you need to troubleshoot timing issues. Yes it's expensive.
  5. Take time to experiment with your processing loop, size your ping and pong buffers based on processing loop time. You may also have to get creative with doing partial workloads to meet your deadlines.
  6. I needed to rewrite some of the CMSIS DSP functions to be faster.
  7. Don't be afraid to dig in to the CMSIS libraries, they are very readable AND they provide a good example of SIMD processing.
  8. when I was benchmarking my code, i found that leaving my firmware .data section in flash didn't give me a super-significant performance hit over ram-located .data. That surprised me.
  9. Use fixed-point data everywhere, convert to float only at the end and only if necessary.

Hope this helps.