Electronic – Cycle counting with modern CPUs (e.g. ARM)

armcortex-m3microcontrollerprogramming

In many applications, a CPU whose instruction execution has a known timing relation with expected input stimuli can handle tasks that would require a much faster CPU if the relationship were unknown. For example, in a project I did using a PSOC to generate video, I used code to output one byte of video data every 16 CPU clocks. Since testing whether the SPI device is ready and branching if not would IIRC take 13 clocks, and a load and store to output data would take 11, there was no way to test the device for readiness between bytes; instead, I simply arranged to have the processor execute precisely 16 cycles' worth of code for each byte after the first (I believe I used a real indexed load, a dummy indexed load, and a store). The first SPI write of each line happened before the start of video, and for every subsequent write there was a 16-cycle window where the write could occur without buffer overrun or underrun. The branching loop generated a 13 cycle window of uncertainty, but the predictable 16-cycle execution meant that the uncertainty for all subsequent bytes would fit that same 13 cycle window (which in turn fit within the 16-cycle window of when the write could acceptably occur).

For older CPU's, the instruction timing information was clear, available, and unambiguous. For newer ARMs, timing information seems much more vague. I understand that when code is executing from flash, caching behavior can make things much harder to predict, so I would expect that any cycle-counted code should be executed from RAM. Even when executing code from RAM, though, the specs seem a little vague. Is the use of cycle-counted code still a good idea? If so, what are the best techniques to make it work reliably? To what extent can one safely assume that a chip vendor isn't going to silently slip in a "new improved" chip which shaves a cycle off the execution of certain instructions in certain cases?

Assuming the following loop starts on a word boundary, how would one determine based on specifications precisely how long it would take (assume Cortex-M3 with zero-wait-state memory; nothing else about the system should matter for this example).

myloop:
  mov r0,r0  ; Short simple instructions to allow more instructions to be prefetched
  mov r0,r0  ; Short simple instructions to allow more instructions to be prefetched
  mov r0,r0  ; Short simple instructions to allow more instructions to be prefetched
  mov r0,r0  ; Short simple instructions to allow more instructions to be prefetched
  mov r0,r0  ; Short simple instructions to allow more instructions to be prefetched
  mov r0,r0  ; Short simple instructions to allow more instructions to be prefetched
  adds r2,r1,#0x12000000 ; 2-word instruction
  ; Repeat the following, possibly with different operands
  ; Will keep adding values until a carry occurs
  itcc
  addscc r2,r2,#0x12000000 ; 2-word instruction, plus extra "word" for itcc
  itcc
  addscc r2,r2,#0x12000000 ; 2-word instruction, plus extra "word" for itcc
  itcc
  addscc r2,r2,#0x12000000 ; 2-word instruction, plus extra "word" for itcc
  itcc
  addscc r2,r2,#0x12000000 ; 2-word instruction, plus extra "word" for itcc
;...etc, with more conditional two-word instructions
  sub r8,r8,#1
  bpl myloop

During execution of the first six instructions, the core would have time to fetch six words, of which three would be executed, so there could be up to three pre-fetched. The next instructions are all three words each, so it wouldn't be possible for the core to fetch instructions as fast as they are being executed. I would expect that some of the "it" instructions would take a cycle, but I don't know how to predict which ones.

It would be nice if ARM could specify certain conditions under which the "it" instruction timing would be deterministic (e.g. if there are no wait states or code-bus contention, and the preceding two instructions are 16-bit register instructions, etc.) but I haven't seen any such spec.

Sample application

Suppose one is trying to design a daughterboard for an Atari 2600 to generate component video output at 480P. The 2600 has a 3.579MHz pixel clock, and a 1.19MHz CPU clock (dot clock/3). For 480P component video, each line must be output twice, implying a 7.158MHz dot clock output. Because the Atari's video chip (TIA) outputs one of 128 colors using as 3-bit luma signal plus a phase signal with roughly 18ns resolution, it would be difficult to accurately determine the color just by looking at the outputs. A better approach would be to intercept writes to the color registers, observe the values written, and feed each register in the TIA luminance values corresponding to the register number.

All this could be done with an FPGA, but some pretty fast ARM devices can be had far cheaper than an FPGA with enough RAM to handle the necessary buffering (yes, I know that for the volumes such a thing might be produced the cost isn't a real factor). Requiring the ARM to watch the incoming clock signal, however, would significantly increase the required CPU speed. Predictable cycle counts could make things cleaner.

A relatively simple design approach would be to have a CPLD watch the CPU and TIA and generate a 13-bit RGB+sync signal, and then have ARM DMA grab 16-bit values from one port and write them to another with proper timing. It would be an interesting design challenge, though, to see if a cheap ARM could do everything. DMA could be a useful aspect of an all-in-one approach if its effects on CPU cycle counts could be predicted (especially if the DMA cycles could happen in cycles when the memory bus was otherwise idle), but at some point in the process the ARM would have to perform its table lookup and bus-watching functions. Note that unlike many video architectures where color registers are written during blanking intervals, the Atari 2600 frequently writes to color registers during a the displayed portion of a frame, and many games rely upon pixel-accurate timing.

Perhaps the best approach would be to use a couple discrete-logic chips to identify color writes and force the lower-bits of color registers to the proper values, and then use two DMA channels to sample the incoming CPU bus and TIA output data, and a third DMA channel to generate the output data. The CPU would then be free to process all of the data from both sources for each scan line, perform the necessary translation, and buffer it for output. The only aspect of the adapter's duties which would have to happen in "real time" would be the override of data written to COLUxx, and that could be taken care of using two common logic chips.

Best Answer

I vote for DMA. It's really flexible in Cortex-M3 and up - and you can do all kind of crazy things like automatically getting data from one place and outputing into another with specified rate or at some events without spending ANY CPU cycles. DMA is much more reliable.

But it might be quite hard to understand in details.

Another option is soft-cores on FPGA with hardware implementation of these tight things.