Electronic – Mysterious hard-fault when I step over

assemblycortex-mcortex-m3hard-fault

This question was rewritten to remove several updates and improve clarity.

I have a Cortex M3 based (and rather obscure) MCU. I have a rather big project, written in C++, in Keil MDK5 with armcc compiler.

I have this function:

bool CanHandle::sendMsg(CanMsg & msg)
{
    bool result = false; // <--------- problematic line

    CAN_TxMsgTypeDef txMsg;

    if (format == FrameFormat::EXTENDED)
    {
        txMsg.IDE = CAN_ID_EXT;
    }
    else
    {
        txMsg.IDE = CAN_ID_STD;
    }

    ENTER_CRITICAL_SECTION(); 

        uint32_t bufN = CAN_GetEmptyTransferBuffer(set.mdrCan);
        if (bufN != CAN_BUFFER_NUMBER)
        {
            CAN_Transmit(set.mdrCan, bufN, &txMsg);
            isTxQueueFull = false;
        }
        else
        {
            isTxQueueFull = true;
            result = false;
        }

    LEAVE_CRITICAL_SECTION();

    return result;
}

When I compile with -O1, compiler produces this assembly listing for it:

0x0800068C E92D41FF  PUSH     {r0-r8,lr}
0x08000690 4604      MOV      r4,r0
0x08000692 2700      MOVS     r7,#0x00
0x08000694 7A20      LDRB     r0,[r4,#0x08]
0x08000696 2600      MOVS     r6,#0x00
0x08000698 F04F0801  MOV      r8,#0x01
0x0800069C 2801      CMP      r0,#0x01
0x0800069E D013      BEQ      0x080006C8
0x080006A0 F88D6005  STRB     r6,[sp,#0x05]
0x080006A4 4860      LDR      r0,[pc,#384]  ; @0x08000828
0x080006A6 6800      LDR      r0,[r0,#0x00]
0x080006A8 F3C00508  UBFX     r5,r0,#0,#9
0x080006AC F8D401FC  LDR      r0,[r4,#0x1FC]
0x080006B0 F000FBCA  BL.W     CAN_GetEmptyTransferBuffer (0x08000E48)
0x080006B4 4601      MOV      r1,r0
0x080006B6 2920      CMP      r1,#0x20
0x080006B8 D009      BEQ      0x080006CE
0x080006BA 466A      MOV      r2,sp
0x080006BC F8D401FC  LDR      r0,[r4,#0x1FC]
0x080006C0 F000FBC4  BL.W     CAN_Transmit (0x08000E4C)
0x080006C4 7266      STRB     r6,[r4,#0x09]
0x080006C6 E004      B        0x080006D2
0x080006C8 F88D8005  STRB     r8,[sp,#0x05]
0x080006CC E7EA      B        0x080006A4
0x080006CE F8848009  STRB     r8,[r4,#0x09]
0x080006D2 B004      ADD      sp,sp,#0x10
0x080006D4 4638      MOV      r0,r7
0x080006D6 E8BD81F0  POP      {r4-r8,pc}

And here's the funny thing: when I try to step over line bool result = false;, I get a hard fault with UNDEFINSTR flag set. PC recovered from stack shows some unexisting address.

But – and here's the really mysterious thing – if I step over assembly, or step into C code or set a breakpoint on line 2 and press run – everything is fine! No hardfault, program runs from there.
If I compile with -O0 or make result volatile, compiler produces different assembly and hard fault doesn't occur.
I tried using different versions of compiler or IDE – problem persists.

Running this program without debugger produces no fault.

Debugging in simulator produces no fault, but stepping over that particular line doesn't actually step over, it makes program runs indefinitely.

This function is called from main, so after it's end there is just while(1). I believe there is no problem in the outside code.

Code of the function is now at its minimum, if I remove any line, literally any line, problem goes away. I previously posted several wrong guesses, I removed them for now. I can't pinpoint any particular instruction or address that produces hard-fault.
All function calls there are dummies, they return immediately. CRITICAL_SECTION macros do not produce critical sections at all but something has to be at those lines or no fault occures.

I have literally no idea how that can happen. I don't know how debugger works exactly, what's the difference between stepping over and setting a breakpoint and hitting "run".

Best Answer

Some ideas which might help when debugging a hard fault on a cortex m4, maybe some of them are useful: - the line which causes the hard fault is put on stack at address +0x18, if the interrupt is synchronous, BFARVALID bit set, if not, it can be forced by setting the bit DISDEBUF from ACTLR system register.

Anothe thing, when executing code from flash, things like the wait state configuration, caches, prefetch buffers might sometime cause errors like this.

I currently have a similar issue, which seems to be influenced by a linker option(ignore_debug_references), not sure how yet...