Electronic – Fault modeling for embedded systems

faultmicrocontrollersensorwireless

I have a wireless sensor circuit with a microcontroller and a 2.4 GHz transceiver module, some integrated sensors with I²C interface, a UART port and the necessary discrete components.

This board is designed for scavenging power from a solar (PV) panel, with a LiPo battery and a shunt charger. This allows the sensor to be self powered and be operating for an indefinite time, requiring the least possible maintenance.

I'd like to explore the possible faults that can occur in a system like this, and that can be due to aging, violation of environmental specs (temperature, humidity and so on) or wrong maintenance (not design issues/bugs), in order to maximize its operation lifetime.

The environment in which the sensor node operates is a building, sticked to the ceiling or the walls. So extreme temperatures or rain are not considered.

What I came up with are some faults that I try to summarize:

  • Component broken -> open\short circuit
  • Sensor faulty -> wrong output values (but how wrong?)
  • Defecting isolation due to dust\water -> increased leakage
  • Temperature out of range -> ???

How can I estimate how the sensor node is going to fail, and why?

Best Answer

There are far too many degrees-of-freedom to understand "all" the possible faults. There are, however, techniques to identify and mitigate faults early in the design cycle (i.e. before wide release).

Design-time activites (pre-hardware)

Peer review is always a great way to find bugs. Have someone else analyze your design, and be prepared to defend against their questions (or acknowledge that they found a bug, and fix it!) There's no substitute for scrutiny, and fresh eyes often see things that are missed by tired ones. This works for both hardware and software - schematics can be reviewed just as easily as source code.

For the hardware, as others have said, a DFMEA (Design Failure Mode and Effects Analysis) is a good recommendation. For each component, ask yourself "what happens if this shorts out" and "what happens if this goes open-circuit", and make a record of your analysis. For ICs, also imagine what happens if adjacent pins are shorted to each other (solder bridges, etc.)

For the firmware, static code analysis tools (MISRA, lint, etc.) can be used to reveal hidden bugs in the code. Things like floating pointers and equality-instead-of-compare (= vs ==) are common 'oopsies' that these tools will not miss.

A written theory of operation is also very helpful, for both hardware and software. A theory of operation should describe in a fairly high level how the system works, how the protections work, sequencing, etc. Simply putting to words how the logic should flow often leads to one realizing that some cases may have been missed ("Um, waitasec, what about this condition?")

Prototype level testing

Once you get hardware in hand, it's time to get to "work".

After all of the theoretical analysis is done, it is crucial to accurately characterize how the device operates within spec. This is commonly referred to as validation testing or qualification. All of the allowable extremes need to be tested.

Another important qualification activity is component stress analysis. Every part is evaluated against its maximum voltage/current/temperature, in a defined operating condition. In order to ensure robustness, an appropriate derating guideline should be applied (don't exceed 80% of voltage, 70% of power, etc.)

Only once you know how things are under normal conditions can you start to speculate about external abnormals, or multiple abnormals like you're describing. Again, the DFMEA model (what happens if X happens) is a good approach. Think of any possible thing a user could do to the unit - short outputs, tie signals together, spill water on it - try them, and see what happens.

A HALT test (highly accelerated life test) is also useful for these types of systems. The unit is put into an environmental chamber and exercised from minimum to maximum temperature, minimum and maximum input and output, with vibration. This will find all sorts of issues, both electrical and mechanical.

This is also a good time to do some embedded fuzz testing - exercise all of the inputs well beyond their expected ranges, send gibberish in through UARTs / I2C, etc. to find holes in the logic. (Bit-banged I2C routines are notorious for locking up the bus, for instance.)

Strife testing is a good way to demonstrate robustness. Disable any protection features like overtemperature, overload, etc. and apply stress until something breaks. Take the unit up as high in temperature as it can go until something fails or some erratic behaviour occurs. Overload the unit until the powertrain fails. If some parameter fails only slightly above worst-case conditions, its an indication of marginality and some design consideration may have to be revisited.

You can also take the next-level approach and physically test some of your DFMEA conclusions - actually do the shorts and opens and pin-shorts and see what blows up.

Further reading

My background is in power conversion. We have an industry standard called IPC-9592A which is an effort to standardize how products should be qualified in terms of what tests and how they should be done. Many of the types of tests and methodologies referred to by this document could easily be used in other electrical disciplines.