Electronic – Micro-controller failure problem as the temperature raises

failurefrequencymicrocontrollertemperature

I'm using at91sam3s8b(cortex-M3) for my project.The hardware board connects to PC using USB port. The firmware contains some cryptographic algorithms (it includes shifts, Xors, memory substitution operations). The algorithm output is correct when the board is powered on but it starts to fail as time passes.

The master clock is configured at maximum (64 MHz).

my observations are:

  1. if I lower the MCK frequency the failure rate decreases. (it's not related to the frequency decrease rate. The failure completely fades if I use a frequency of 55 MHz)

  2. If I raise the board temperature the failure happens sooner and if I lower the temperature enough the failure fades again!

  3. Some micro-controllers fail (the most of them). but some others do not. (even at an artificial temperature of 85 degrees of centigrade)

The malfunctioning boards and hardware are sent to Atmel for test. they conducted a few electrical tests and answered there is no problem relating to our board or their micro-controllers.

Any idea how or where I can track this Issue? Any technical suggestion?

Edit:

More information:

  • The project is being built using keil 4.7
  • The cryptographic routines are implemented in assembly and c and are linked to the main project as static libraries (there is 2 separate libraries. assembly is built using code sourcery).
  • changing the order of libraries in the project displaces the fault to another algorithm. changing the name, place or declarations of some functions removes or displaces the fault.
  • The hardware board is simple. a microcontroller, a USB port and a SPI flash memory(winbond serial nand flash) + that little LED. The USB port and SPI flash are not used in the simplified test code which is still faulty.

Best Answer

@Taheri - This is not an answer, but it would need 6+(?) comments due to space limits and unfortunately comments can't have blank lines for separation of the various points. So please just treat this as a comment giving suggestions, with better formatting and more detail :-)


From experience, much more detail (some likely under NDA) would be needed by us remote readers to efficiently troubleshoot this problem on your boards, making it difficult to achieve this via remote help. The suggestion from @BrianDrummond to eliminate your hardware by using an Atmel-approved board, is a good one (+1). I have seen similar problems due to system design mistakes, nothing to do with the MCU, which Brian's suggestion should help to eliminate.

There are some standard troubleshooting approaches which either haven't been mentioned, or haven't been concluded - here is a brief, non-exhaustive, list:

(a) Simplify the code further; it seems you've done more of this than mentioned in the original question, but yet more can be done to identify exactly where in your algorithm, the incorrect result starts. Keep reducing the amount of code and inserting fixed data values, until you cannot remove any more code without the problem "disappearing". Then look at the remaining "minimal" code.

What is unusual about that specific "failing" code, which could explain why you are seeing incorrect behaviour, but most other users of the same chip are not (otherwise, if this was easy to trigger, we would likely be flooded with people reporting it) e.g. are you using an unusual peripheral, like the CRC calculation unit, which many other people won't use in their designs? What else are you doing differently, which could explain why you are triggering this issue, yet the other thousands of similar MCUs aren't failing in the same way?

(b) Don't accept that the data at some point during your algorithm is just wrong (from code calculation? or read from RAM? or something else? it's difficult to give more precise suggestions without seeing your code...). Instead identify exactly how it is wrong, by comparing the actual and the expected values at each point (bit shift? bit flip? faulty addition result? etc), and look for consistency (or lack of consistency) on the same board and across boards. Then focus on finding similarities / differences between the groups of similarly affected and similarly unaffected boards. This specific data can help, among other potential uses, to see whether your code and observed behaviour matches with any Errata from Atmel.

(c) What is the history of this project? When did you first notice this behaviour? If previous prototypes were not affected, what is different with those, compared to the "failing" boards you are asking for help with?

(d) Consider other drastic changes e.g. running the code from RAM instead of from Flash (assuming you have a JTAG port to be able to do this more easily). If the same incorrect runtime behaviour is still observed when running from RAM as when running from Flash, then this eliminates a hypothesis of timing problems when reading from Flash (doesn't it?) and that will be one step closer to finding the root cause.

Edit: (e) Have you tried transplanting an MCU from a "failing" board onto a "non-failing" board, to see if the problem moves with the MCU or stays with the PCB and the other components on it?

Hope that helps.