Electronic – Finding the source of a Hard Fault using extended HardFault_Handler

atmelatmel-studiocortex-mhard-faultmicrocontroller

I have been hitting some hard-faults on the firmware I have created with FreeRTOS on a SAMD21 (ARM Cortex-M0) MCU.

So I took a further action to find out the cause and eventually bumped into this article on Code_Red pointing out the snippet mentioned below.
However, in this stage it's not clear for me how to use the numbers I have extracted after this method is hit.

Obviously I have bunch of memory locations, however, how can I make conclusions on which line of code caused the issue according to these locations?

BTW, the call stack has not been useful and only has a single in it which point to the current breakpoints in the HardFault_handlerC()

Thanks in advance for your help,

/**
 * HardFault_HandlerAsm:
 * Alternative Hard Fault handler to help debug the reason for a fault.
 * To use, edit the vector table to reference this function in the HardFault vector
 * This code is suitable for Cortex-M3 and Cortex-M0 cores
 */

// Use the 'naked' attribute so that C stacking is not used.
__attribute__((naked))
void HardFault_HandlerAsm(void){
        /*
         * Get the appropriate stack pointer, depending on our mode,
         * and use it as the parameter to the C handler. This function
         * will never return
         */

        __asm(  ".syntax unified\n"
                        "MOVS   R0, #4  \n"
                        "MOV    R1, LR  \n"
                        "TST    R0, R1  \n"
                        "BEQ    _MSP    \n"
                        "MRS    R0, PSP \n"
                        "B      HardFault_HandlerC      \n"
                "_MSP:  \n"
                        "MRS    R0, MSP \n"
                        "B      HardFault_HandlerC      \n"
                ".syntax divided\n") ;
}

/**
 * HardFaultHandler_C:
 * This is called from the HardFault_HandlerAsm with a pointer the Fault stack
 * as the parameter. We can then read the values from the stack and place them
 * into local variables for ease of reading.
 * We then read the various Fault Status and Address Registers to help decode
 * cause of the fault.
 * The function ends with a BKPT instruction to force control back into the debugger
 */
void HardFault_HandlerC(unsigned long *hardfault_args){
        volatile unsigned long stacked_r0 ;
        volatile unsigned long stacked_r1 ;
        volatile unsigned long stacked_r2 ;
        volatile unsigned long stacked_r3 ;
        volatile unsigned long stacked_r12 ;
        volatile unsigned long stacked_lr ;
        volatile unsigned long stacked_pc ;
        volatile unsigned long stacked_psr ;
        volatile unsigned long _CFSR ;
        volatile unsigned long _HFSR ;
        volatile unsigned long _DFSR ;
        volatile unsigned long _AFSR ;
        volatile unsigned long _BFAR ;
        volatile unsigned long _MMAR ;

        stacked_r0 = ((unsigned long)hardfault_args[0]) ;
        stacked_r1 = ((unsigned long)hardfault_args[1]) ;
        stacked_r2 = ((unsigned long)hardfault_args[2]) ;
        stacked_r3 = ((unsigned long)hardfault_args[3]) ;
        stacked_r12 = ((unsigned long)hardfault_args[4]) ;
        stacked_lr = ((unsigned long)hardfault_args[5]) ;
        stacked_pc = ((unsigned long)hardfault_args[6]) ;
        stacked_psr = ((unsigned long)hardfault_args[7]) ;

        // Configurable Fault Status Register
        // Consists of MMSR, BFSR and UFSR
        _CFSR = (*((volatile unsigned long *)(0xE000ED28))) ;   
                                                                                        
        // Hard Fault Status Register
        _HFSR = (*((volatile unsigned long *)(0xE000ED2C))) ;

        // Debug Fault Status Register
        _DFSR = (*((volatile unsigned long *)(0xE000ED30))) ;

        // Auxiliary Fault Status Register
        _AFSR = (*((volatile unsigned long *)(0xE000ED3C))) ;

        // Read the Fault Address Registers. These may not contain valid values.
        // Check BFARVALID/MMARVALID to see if they are valid values
        // MemManage Fault Address Register
        _MMAR = (*((volatile unsigned long *)(0xE000ED34))) ;
        // Bus Fault Address Register
        _BFAR = (*((volatile unsigned long *)(0xE000ED38))) ;

        __asm("BKPT #0\n") ; // Break into the debugger

}

Best Answer

So, here's the fun part: it may be impossible to cite exactly which line is throwing the fault. The reason is that a bug in your code may be causing a fault to appear elsewhere -or- the bug might be destroying all the state information in the system, which is super cool. What would really help, though, is to see your entire code base: including the linker scripts and startup code.

In general though, if you are ending up in hard-fault territory, Here are the first things I would check:

  • Faults caused by trying to dynamically allocate memory when there is no heap defined by your linker. What happens here is that some function is calling malloc (or one of its cousins) and library is failing because there is not enough space on the heap to allocate memory, so it crashes the program. This is a real possibility for you, you are using an RTOS & most vanilla linker scripts don't have heap space allocated. See this: https://stackoverflow.com/questions/10467244/using-newlibs-malloc-in-an-arm-cortex-m3

  • Faults caused by doing something silly like writing data past the end of an array. This can be really easy to do if you are using math to generate array indices or using pointers to elements directly. What (can) happen here is if your boundary checks are buggy, when you write data to your array, you may, in fact, just be overwriting everything! If this doesn't cause an error directly (e.g. writing to a read-only or protected location), it may just break your stack. Then you jump to a garbage location, and probably execute an invalid instruction and then fault.

I'd also take a look at this document, which is related to your Code Red post. Even though the instructions are for ARM Cortex-M3 and ARM Cortex-M4 the method of interpreting the results are the same.
Debugging Hard Fault & Other Exceptions

Using the Register Values

The first register of interest is the program counter. In the code above, the variable pc contains the program counter value. When the fault is a precise fault, the pc holds the address of the instruction that was executing when the hard fault (or other fault) occurred. When the fault is an imprecise fault, then additional steps are required to find the address of the instruction that caused the fault.

To find the instruction at the address held in the pc variable, either...

  1. Open an assembly code window in the debugger, and manually enter the address to view the assembly instructions at that address, or

  2. Open the break point window in the debugger, and manually define an execution or access break point at that address. With the break point set, restart the application to see which line of code the instruction relates to.

Knowing the instruction that was being executed when the fault occurred allows you to know which other register values are also of interest. For example, if the instruction was using the value of R7 as an address, then the value of R7 needs to be know. Further, examining the assembly code, and the C code that generated the assembly code, will show what R7 actually holds (it might be the value of a variable, for example).

Those are just my top-two off-the-top reasons. If you post your entire code base, we can probably give you more direct help. Good luck!