I'm using PIC controllers, and I have used watch dog timers to trigger a reset in case of the s/w gets stuck somewhere, and also to support in case of severe hardware crashes.
I think that resetting the CPU is good, but reporting the crash should also be a very good viable option to help debugging. I'm currently searching a method to do so. I have an EEPROM with free space for such purpose.
- Is there any standard procedure for reporting crashes?
- What parameters shall I monitor / trace in such crash reports?
In addition to Nick's suggestion to use the
RCONwhich is very useful here are a few other ideas / pointers that you should keep in mind:
Make sure to check the endurance of the EEPROM if you're always writing data to the same location. Even if it has an endurance of 1,000,000 million cycles if your system can start up in 100ms it wouldn't take much over a day of constant reboots to cause an EEPROM failure if you get constant reboots due to flaky connections / power etc. Maybe consider a delay for non-watchdog restarts or introduce a delay on pupose to avoid that possibility.
Another (probably easier / better way) to avoid that problem is to set aside an area of EEPROM into fixed size blocks and use a marker at the start of block to indicate if it's been used (non 0xFF value) and simply stop recording errors once the memory is full. Then after the error log has been read erase everything so it's good to go for next time.
In the case of a watchdog reset the contents of RAM will still be intact. You'd need to test / verify / change this for the particular compiler you're using but for the case of a watchdog reset it would give the possibility of logging the value of some of the more important state variables as part of the error report. Many compilers zero the contents of RAM upon startup so you'll need to check into that side of things. You'll need to ignore the contents of the memory for other restart types though.
Some sort of timestamp will be valuable in later analysis. Obviously an RTC go give an absolute date/time would be ideal but if not available maybe you could include some sort of tick counter to at least give a relative idea of the time span between failures.
You could always read and record the value of I/O lines and analog inputs and record them upon reboot, but be aware their value may have changed since the initial condition that caused the crash.
Depending on the code space and performance you have left to spare if you preserve the contents of a few RAM locations across a watchdog restart you could also consider adding an integer or bitmask to indicate what code / interrupts were called just before the error condition.