Electronic – Could an ARM (ARM7TDMI) Branch instruction take 6 cycles

armassemblyc

I have found an ARM Branch instruction appears to take 6 cycles to run on an ARM7TDMI processor. It seems it shouldn't be happening because in all references I've found, an ARM7TDMI branch instruction should take only 3 cycles. But:

The C function:

start_time = TC;
for (int i=0; i<120; i++) {
  __asm("NOP");
}
end_time = TC;

The disassembly shows the loop as: (Update: instruction addresses added):

0x120             MOV R1, 0
0x124             B LOC0
            start:    
0x128             NOP
0x12C             ADD R1, R1, 1
            LOC0:     
0x130             CMP R1, 120
0x134             BLT start

Now the result shows the loop takes 1080 cycles (converted from a timer counter put in TC), i.e., 9 cycles per loop kernel. Since NOP, ADD, CMP are all single cycle instructions, BLT has to be 6 cycles.

I once suspect if my timing method has flaws. But if I add 1 NOP in the loop kernel, the time increase would amount to exactly 1 cycle.

What's wrong here?

(Update: fix: the original disassembly code miswrote ADD R1, R1, 1 as ADD R1, R1)

Update: Answer Accepted: Flash access stall causes the 3 extra cycles

Thanks all for the helpful answers and comments, especially @supercat, @Dzarda, @DaveTweed, @IgorSkochinsky, @WoutervanOoijen. I am running code from flash. The CPU is a LPC23xx. According to the User Manual, it does include a Memory Acceleration Module (MAM) for bufferred flash access. And the suggested flash fetch cycles under my CPU speed is exactly 3 cycles.

The start in the above penalized loop kernel is aligning to a 8-byte boundary. If I change the alignment of start to 16-byte boundary, then the 3 extra cycles penalty disappeares. This can be explained by the 128 bits (16 bytes) flash prefetch buffer size of my CPU.

(@WoutervanOoijen) Note the 3-cycle MAM flash fetch time is not done by ARM CPU, but by the MAM module which prefetches the flash data in parallel with CPU. So in my code with start aligning at 8-byte boundary, CMP is the first instruction in the 128-bit (4-instruction) MAM prefetch buffer. When the ARM CPU executes BLT, it takes the 1st cycle to "understand" the instruction. Then it tries to fetch NOP instruction which is not in the MAM prefetch buffer. That should be the moment when the extra 3 cycles happens when the MAM accesses the flash. When the NOP instruction is in the buffer (along with 3 other instructions in the 32-byte flash line), the ARM CPU can actually re-fill the pipeling by fetching NOP (5th cycle) and decoding NOP (6th cycle). That's where the total 6 cycles come from.

So the answer to my question is Yes, a 6-cycle branch instruction is possible if there's a flash access stall.

Final Unresolved Question

As @WoutervanOoijen points out, the above reasoning has a flaw. LPC23xx's Memory Acceleration Module has an additional Branch Trail buffer that is supposed to avoid this kind of repeated re-fetch loop branches. The LPC23XX User Manual states:

The Branch Trail buffer captures the line to which such a non-sequential break occurs. If the same branch is taken again, the next instruction is taken from the Branch Trail buffer

This statement doesn't seem to be very clear about what's exactly being put into the Branch Trail buffer. It could be the last prefetched flash line, or the last branch destination flash line. In either case, the flash access penalty shouldn't have happened because the flash line (0x120 ~ 0x12F) including the branch destination instruction (NOP) should already be in the Branch Trail buffer when BLT is being executed (at least from the second time on).

(BTW, I verified the MAM is put in Fully Enabled mode, i.e. MAM_mode_control is 2.)

I will update this question after I find more information about this. And I'll appreciate it if you have any comments on what might be happening here, or what test can be done to look for clues.

Best Answer

Are you running code from RAM or from flash? ARM processors that run code from flash often require wait states in at least some circumstances; such processors often include hardware which can eliminate most of the wait states in common code, but such hardware may be as simple as a single-line buffer which allows an access to the same line of flash as the previous access to avoid the wait state. If the branch target is the last word of a flash line, then the flash would require two or three cycles to fetch that word, and two or three cycles to fetch the following word. If one of the cycles is performed concurrently with some other CPU operation, that would leave a three-cycle penalty.