Electronic – Micro-controller failure problem as the temperature raises

failurefrequencymicrocontrollertemperature

I'm using at91sam3s8b(cortex-M3) for my project.The hardware board connects to PC using USB port. The firmware contains some cryptographic algorithms (it includes shifts, Xors, memory substitution operations). The algorithm output is correct when the board is powered on but it starts to fail as time passes.

The master clock is configured at maximum (64 MHz).

my observations are:

if I lower the MCK frequency the failure rate decreases. (it's not related to the frequency decrease rate. The failure completely fades if I use a frequency of 55 MHz)
If I raise the board temperature the failure happens sooner and if I lower the temperature enough the failure fades again!
Some micro-controllers fail (the most of them). but some others do not. (even at an artificial temperature of 85 degrees of centigrade)

The malfunctioning boards and hardware are sent to Atmel for test. they conducted a few electrical tests and answered there is no problem relating to our board or their micro-controllers.

Any idea how or where I can track this Issue? Any technical suggestion?

Edit:

More information:

The project is being built using keil 4.7
The cryptographic routines are implemented in assembly and c and are linked to the main project as static libraries (there is 2 separate libraries. assembly is built using code sourcery).
changing the order of libraries in the project displaces the fault to another algorithm. changing the name, place or declarations of some functions removes or displaces the fault.
The hardware board is simple. a microcontroller, a USB port and a SPI flash memory(winbond serial nand flash) + that little LED. The USB port and SPI flash are not used in the simplified test code which is still faulty.

Best Answer

@Taheri - This is not an answer, but it would need 6+(?) comments due to space limits and unfortunately comments can't have blank lines for separation of the various points. So please just treat this as a comment giving suggestions, with better formatting and more detail :-)

From experience, much more detail (some likely under NDA) would be needed by us remote readers to efficiently troubleshoot this problem on your boards, making it difficult to achieve this via remote help. The suggestion from @BrianDrummond to eliminate your hardware by using an Atmel-approved board, is a good one (+1). I have seen similar problems due to system design mistakes, nothing to do with the MCU, which Brian's suggestion should help to eliminate.

There are some standard troubleshooting approaches which either haven't been mentioned, or haven't been concluded - here is a brief, non-exhaustive, list:

(a) Simplify the code further; it seems you've done more of this than mentioned in the original question, but yet more can be done to identify exactly where in your algorithm, the incorrect result starts. Keep reducing the amount of code and inserting fixed data values, until you cannot remove any more code without the problem "disappearing". Then look at the remaining "minimal" code.

What is unusual about that specific "failing" code, which could explain why you are seeing incorrect behaviour, but most other users of the same chip are not (otherwise, if this was easy to trigger, we would likely be flooded with people reporting it) e.g. are you using an unusual peripheral, like the CRC calculation unit, which many other people won't use in their designs? What else are you doing differently, which could explain why you are triggering this issue, yet the other thousands of similar MCUs aren't failing in the same way?

(b) Don't accept that the data at some point during your algorithm is just wrong (from code calculation? or read from RAM? or something else? it's difficult to give more precise suggestions without seeing your code...). Instead identify exactly how it is wrong, by comparing the actual and the expected values at each point (bit shift? bit flip? faulty addition result? etc), and look for consistency (or lack of consistency) on the same board and across boards. Then focus on finding similarities / differences between the groups of similarly affected and similarly unaffected boards. This specific data can help, among other potential uses, to see whether your code and observed behaviour matches with any Errata from Atmel.

(c) What is the history of this project? When did you first notice this behaviour? If previous prototypes were not affected, what is different with those, compared to the "failing" boards you are asking for help with?

(d) Consider other drastic changes e.g. running the code from RAM instead of from Flash (assuming you have a JTAG port to be able to do this more easily). If the same incorrect runtime behaviour is still observed when running from RAM as when running from Flash, then this eliminates a hypothesis of timing problems when reading from Flash (doesn't it?) and that will be one step closer to finding the root cause.

Edit: (e) Have you tried transplanting an MCU from a "failing" board onto a "non-failing" board, to see if the problem moves with the MCU or stays with the PCB and the other components on it?

Hope that helps.

Related Solutions

Electronic – Building a circuit with LPC1343

A debug LED (You can convert it into a watchdog blinky later to verify that your main loop/1ms interrupt or whatever you're using is still running) is something that I would consider pretty mandatory for an exploratory board. Hello World on your new PCB does not need to be as complex as an LCD. You could repurpose a backlight controlling MOSFET for this purpose if you don't want to add the real components.

I'm assuming you're giving yourself some form of breakout for your extra pins - An LCD screen is great, and I understand the desire to keep it simple, but there's little that can go wrong simply by adding a trace to nowhere, and nowhere can become somewhere someday. Even if you don't want to add real headers, some test points (in the form of staggered rows of .05x.1" copper pads) will let you solder and hot glue some wires on later. This doesn't have to be a big deal. I'd put some jumpers/resistors on those lines, so you can add some 1k resistors to protect your pins from being shorted or hit with ESD if you decide to do so. This also gives you the ability to pull any of your other pins high or low if later you find this is necessary!

One thing that I do on a first board is add a lot of vias. Vias are your friends when making modifications (assuming you're getting this done at a PCB house and don't have to drill them yourself). If you've got two vias on every trace, even if you don't change sides with your trace, you can cut the trace later with an Xacto and run 30-ga wire wrap wire between the traces that need to be swapped (Make sure your vias are big enough for this, though). You can also add 0805 0-ohm jumpers (solder bridges are cheap; you don't need to buy components) and solder wires to the pads later if you don't like the via method. Probably won't be necessary, but it's cheap/free insurance.

Oh, and connect the LCD/USB setup first, then tack wires on temporarily from your working breadboard to make sure that the externals are working.

Electronic – How to Efficiently Decode Non-Standard Serial Signal

Another answer: Stop using interrupts.

People jump to use interrupts too easily. Personally, I rarely use them because they actually waste a lot of time, as you are discovering.

It's often possible to write a main loop which polls everything so rapidly that's it's latency is within spec, and very little time is wasted.

loop
{
    if (serial_bit_ready)
    {
        // shift serial bit into a byte
    }

    if (serial_byte_ready)
    {
        // decode serial data
    }

    if (enough_serial_bytes_available)
    {
        // more decoding
    }        

    if (usb_queue_not_empty)
    {
        // handle USB data
    }        
}

There might be some things in the loop which happen far more often than others. Perhaps the incoming bits for example, in which case, add more of those tests, so that more of the processor is dedicated to that task.

loop
{
    if (serial_bit_ready)
    {
        // shift serial bit into a byte
    }

    if (serial_byte_ready)
    {
        // decode serial data
    }

    if (serial_bit_ready)
    {
        // shift serial bit into a byte
    }

    if (enough_serial_bytes_available)
    {
        // more decoding
    }        

    if (serial_bit_ready)
    {
        // shift serial bit into a byte
    }

    if (usb_queue_not_empty)
    {
        // handle USB data
    }        
}

There might be some events for which the latency of this approach is too high. For example, you might need a very accurately timed event. In which case, have that event on interrupt, and have everything else in the loop.

Best Answer

Related Solutions

Electronic – Building a circuit with LPC1343

Electronic – How to Efficiently Decode Non-Standard Serial Signal

Related Topic