Electronic – How to roughly know if a electronic scheme will fail soon and protect from it

durabilityfailurereliabilitysafetysystem

I am thinking of a farming project that includes using some board like Arduino, Raspberry Pi or Onion Mega (the list is not exclusive).

As the system will work with sensors and support life of its wards, their health will depend on the work of the whole chain of components.

I am surely going to keep the main board in a safe place, like, put it in a moisture- and temperature-proof case, and isolate contacts where needed, but I understand that these boards are more for education/experiments than for real daily duty. Also, there always a factor of defectiveness in the board coming from the manufacturer.

So, I wonder, if there is information about how durable the boards are and if they are suitable for 24/7 work during weeks/months?

How do I make sure that system has some more or less definite margin of safety and know the moment when I should replace it with a new one?

Best Answer

You have to look for information about RAMS (reliability, availability, maintainability and safety) engineering.

Basic RAMS concepts and techniques

  • Failure rate: number of expected failures of a component, assembly or product per time unit.
  • MTTF (mean time to failure) / MTBF (mean time between failures): the inverse of the failure rate. The expected time your component/assembly/unit will be operating under given conditions until a failure happens.
  • ER (established reliability) vs. non-ER components: so-called hi-rel (high reliability) components are often lot-tested to establish their failure rate, which makes them expensive. On the other hand, for non-ER components a rather pessimistic failure rate is assumed according to tabulated values.
  • Parts Count Analysis (PCA) / Parts Stress Analysis (PSA): a method to calculate the expected value for the failure rate of an assembly/product, deriving it from the failure rate of each component and its associated stress (temperature, moisture, power/voltage/current derating, etc).
  • Derating: the % of the maximum power/voltage/current rating at which the component/assembly/product operates. The higher the derating, the lower the stress and the longer the MMTF.

  • Bath tub curve: a curve describing how failure rate changes along the useful life of the component/assembly/product. See image below.

  • Burn-in: a non-destructive, initial high-temperature (accelerated aging) test intended for precipitating early failures in already defective components/assemblies/products. It's a kind of screening test.
  • Life test: a destructive, high-temperature (accelerated aging) test intended for establishing the reliability of a whole lot of components/assemblies/products from a reduced sample submitted to this test.

Bath tube curve

Image source.

Where do I begin?

  1. Download MIL-HDBK-217F, RELIABILITY PREDICTION OF ELECTRONIC EQUIPMENT. There you'll find almost all tabulated values you'll need. You don't need to implement all the methods described in it from the beginning, so don't panic about its complexity.
  2. Create an excel sheet for basic reliability data from your BOM (bill of materials). The columns must include at least the following information about the components: P/N, description and base failure rate. We'll add more information later, if needed.
  3. Populate the excel sheet with base failure rate data and carry out a basic PCA to calculate your first rough approximation to the failure rate and MTTF of your assembly/product. Don't forget to include the solder joints in the analysis!
  4. Look at he results of your PCA and compare them with the MTTF required by your application:
    • If the PCA delivers an insufficient MTTF, you're already in trouble and should go back to your design, your parts selection or your calculations to check what's wrong with them.
    • If the PCA delivers a MTTF well above your requirement (by a 1000x margin or more) then you might want to stop here. Just check that there aren't any components operating too close to their maximum ratings).
    • If the PCA delivers a MTTF above your requirement, but without high enough margin, then you'll have to calculate the actual stresses for the components.
  5. If your PCA was inconclusive, then you'll need to carry out a PSA with the actual stresses and environmental conditions (temperature, moisture) of your assembly/product:

    • Go back to your excel sheet and add more columns to take into account the pi-factors in MIL-HDBK-217F (temperature, quality, environmental, power rating, voltage stress, etc.). Pi-factor are modifiers of the base failure rate according to actual stress conditions.
    • Populate the new fields in your excel sheet with data for the component datasheets, but also from your own circuit simulation and calculations.
    • Recalculate the modified failure rates for each component according to their pi-factors.
    • Recalculate the total failure rate and MTTF of your assembly/product.
    • Look at he results of your PSA and compare them with the MTTF required by your application. If the results are good, then you're all set. If not, look for the components that contribute the most to the total failure rate and address their problems individually: higher power/voltage/current rating replacement component required? changes in certain design values required to avoid too much power/voltage/current in the problematic component? heatsinking required? etc.
  6. If you've done everything in your hand to reduce the total failure rate but you still can't get a MTTF compatible with your requirement, then you might want to add redundancy to your design, but specifically targeted at subassemblies of your product with high partial failure rates. Redundancy must be introduced only when MTTF calculations demand it, and never in a preemptive manner. Why? Because redundancy needs adding switching elements that can fail themselves and introduce unneeded complexity as well.

  7. Even if your PCA/PSA says everything will be OK, keep in mind that that will be true for random failures only! The PCA/PSA doesn't deal with the early failure rates of defective components/assemblies/products. Therefore, a burn-in of your product is highly recommended before deployment in the field.

  8. If you want to have actual statistical data about the useful life of your assembly/product, you might want to do a life test. But that means spending money in the samples that will be destroyed or worn out during life testing, and having the time (usually around 1,000 hours or more, depending on the testing temperature) and means to carry it out.

Notes below:

  1. There are also specialised reliability prediction software packages that will make all these calculations easier for you. Only you can decide whether you application and business case calls for such an investment.

  2. Here's a free reliability prediction software I've found (disclosure: I've never used it).

  3. I've looked for reliability data (MTBF) for Raspberry Pi without any success...