Electrical – FITS in reliablity. How to convert number of failures in actual hardware testing to MTBF

reliability

In the following post

What are FIT's and how they used in reliability calculations?

Barry stated "There are standard formulas that convert the number of failures in a given test time to MTBF for a selected confidence level". Can you guide me to these specific formulas references etc.? In other words I want to test actual hardware and derive reliability (MTBF etc.) from actual tests. I already understand what Barry wrote in his 1.67 million hours mtbf example.

Best Answer

I've never actually derived an MTBF from test data, so take my answer with a grain of salt, but here is what I would do:

You need to build a statistical model that you can use to estimate the MTBF value for the device under test. Building such a model isn't conceptually very hard, and that might be what Barry was referring to in his original post when he talked about "Standard formulas" for calculating MTBF. However, the execution is tedious.

To help with understanding, let's suspend reality for a second and pretend that we have infinite time and money to get to the answer.

Let's say that we have a set of N devices. Each device that is placed into service will eventually fail after some time. If you measure how much time it takes for each device to fail and then calculate the average, this number is the MTBF (actually, it's MTTF, but let's use MTBF in this discussion...the terms are often interchanged) for this particular set of devices.

If you then produce another set of N devices, you can run the same experiment again and calculate another MTBF. This number will likely not be exactly the same as the first one because the MTBF is a random variable - meaning it is statistical in nature.

So you can consider each MTBF number calculated for each set of devices (or each observation) to be an estimate of the "true" MTBF.

Let's say you do this 100 times, so you have 100 MTBF numbers. Your boss comes to you says "I have an order from a customer and they want to know what the MTBF will be".

Eventually, someday in the future, the devices that are sent to the customer will fail and there will be actual data that can be used to calculate a real MTBF for that set. Let's call this number MTBF_cust. Your boss and the customer want that to know that number now.

You could scream "I can't predict the future!" and storm out, or you can use MTBF dataset you have developed to make an estimate of MTBF_cust.

The way to make the estimate is to realize that each MTBF estimate is an observation of the random variable that you are trying to measure (i.e. the "true" MTBF) and thus the statistics of the dataset is a good approximation of the statistics of the random variable itself. You can calculate an average for the dataset. Let's call that MTBF_mu. So you can tell you boss that it's probably going to be MTBF_mu.

"Probably? What do you mean probably?".

At this point, you can get into a discussion of how much money you are willing to bet on your answer in order to instill confidence (I once had a colleague who quantified confidence in terms of how many steak dinners one was willing to buy if they were wrong), but we can do better. Since you have a large dataset of MTBF estimates from which you can derive statistics, you can calculate a confidence value for how close MTBF_mu is to the "true" mean. You do this by looking at the spread of your estimates and then selecting a range where you are confident that the measured value (MTBF_cust) will fall. The spread of your data is quantified by calculating a standard deviation, usually denoted sigma, and the range you select for a given confidence value of X% is called a X% confidence interval, denoted CI[X%].

A standard deviation can be thought of as an "average distance from the mean". It's easy to understand what an average is - a standard deviation is the average of the dataset after you subtract the mean from each value. In order to remain consistent with the idea of "distance", the resulting dataset needs to be entirely positive. You might think that you should take the absolute value of each number, but that leads to lots of nasty math issues because absolute value's derivative is discontinuous at zero. So we use the next best thing - we square each value and then take the square root of the mean (This is called "root-mean-square", or RMS). It still represents the same idea, although it makes the formula look a little more imposing. Let's call the standard deviation of your MTBF dataset MTBF_sigma.

So, if we have a bunch of MTBF estimates, and they tend to cluster around a given value (MTBF_mu), then the MTBF_sigma represents how much one MTBF estimate is like another. If they all tend to be nearly the same number (they are tightly clustered), then MTBF_sigma will be small and you can say that the MTBF_mu is gonna be really close to the MTBF_cust. If each MTBF is very different from the next, then MTBF_sigma will be large and you should bet fewer steak dinners on your answer.

Now that we have MTBF_mu and MTBF_sigma, we have our statistical model. The reason we can do this is because of something call the central limit theorem, which basically states that measuring the same random variable over and over again will lead to a Gaussian distribution of measurements. Gaussian distributions are completely characterized by the mean and standard deviation.

So, if you think of MTBF_cust as just another observation of the same thing that you've already observed 100 times, you can realize that MTBF_cust must have the same underlying statistics as all of the previous 100 observations.

You can now define an arbitrary range of potential MTBF values, centered on MTBF_mu, and calculate the probability that MTBF_cust will be in that range. It is a well established result of statistics that there is a 68% chance that an observation of a Guassian-distributed random variable wil be within 1*sigma of the mean. There is a 95% chance that it will be within 2*sigma of the mean. And so on...see the table on the wikipedia page for std. deviation. You can make the range arbitrarily large or small and calculate your confidence that MTBF_cust will fall into that range.

The range that corresponds to a X% likelihood that the MTBF_cust will be contained in that range is called a "X% Confidence Interval". i.e. "I am 95% confidence that MTBF_cust will be between (MTBF_mu-2*MTBF_sigma) and (MTBF_mu+2*MTBF_sigma)".

So now you can tell you boss "how confident do you want to be? pick any number between, but not including, 0 and 100%."

A typical value chosen is 95%, but depending on your situation you may choose a smaller or larger value.

But MTBFs aren't typically reported as a range - they are given as a number. Given a particular confidence interval, you could conceivable report the low end, the high end, or any number in between. Which do you choose?

This is somewhat of a judgement call. Clearly, the lower values are more conservative. When reporting MTBF, if you turn out to be wrong, it's probably much more desirable to underestimate the MTBF than to overestimate it (although there is cost to being wrong in either direction...).

So, let's modify our thinking to now say that, instead of choosing a confidence interval centered around the mean, let's choose a confidence interval that contains all values greater than or equal to the MTBF number that we choose to report. This is pretty easy to do. Let's say you want to know the likelihood that MTBF_cust is less than MTBF_mu+2*MTBF_sigma. Since we know that +/- 2*MTBF_sigma represents a 95% confidence interval, that means that there is a 5% chance that MTBF_cust is not in that range. Given that the Gaussian distribution is completely symmetric, that 5% is evenly split between the "tails" on the two sides - so there is a 2.5% chance that MTBF_cust greater than the high end of the confidence internal, and 2.5% that it is less. So now we know that if we report the the MTBF is MTBF_mu-2*sigma (the low end of the confidence interval), we have a 2.5% chance that we will have overestimated MTBF_sigma, and a 97.5% chance that we have underestimated it. So that's a pretty safe value.

Ok, so now let's exit theoretical la-la land because we realize that we don't have infinite time and money to gather millions of devices and wait for them to fail so we can build a "real" distribution of the MTBF. That's like selling it and waiting for field failures, but without revenue and with a lot of extra testing cost.

You still need to build a statistical model and come up with estimates of MTBF_mu and MTBF_sigma. To do this in a slightly more realistic (but still expensive and time consuming manner), you can construct a highly accelerated life test (HALT). This test will likely be some combination of temperature variation, vibration, exposure to gasses, liquids, or whatever other environmental factors you expect your devices to experience in their application, all while exercising the device and testing for failure. You may choose to increase the ranges experienced by the devices in order to accelerate the test further.

The model you build will be heavily influenced, of course, by the test conditions used to induce failures. You should choose test conditions based on the environment that you expect the device to be used in. I've seen reliability analyses in which one had to select a "mission profile", with each profile representing a different set of environmental stress factors and thus the reliability numbers would change.

Once you have selected your mission profile(s), you need to design a HALT (one for each profile for which you want an MTBF number) that you believe will induce some failures in your devices within a reasonable amount of time (i.e. an amount of time that you are willing to wait to get your data).

You can then take some number of devices and run them through the test. Your test may call for running for a certain number of hours (test to bogey), or you may call for running until a certain number (perhaps all) have failed (test to failure), or you may replace devices as they fail and just keep running until you have enough data to be confident in your model (test to confidence). I would prefer the latter, but time and money may dictate what you can really do. In any case, the idea is to induce failures using the conditions in the HALT and then build a statistical distribution of the failure time.

HALT data will, of course, provide accelerated failure rates, and you need to find some appropriate way to scale that data into real values. This is an area that I am completely unfamiliar with, but some research into HALT testing techniques would likely reveal some good methods that have been developed in industry. If you come across good resources, I'd be interested in reading them. From the little bit that I've seen, part of constructing a HALT test involves coming up with multipliers for each acceleration factor (i.e. if you double the duty cycle, you cut the life of the component in half, and if you increase the max temperature range by 10%, you reduce the life of the component by 40%, etc etc) and then combining these to come up with a single life-reduction factor. You then multiply your actual HALT-derived MTBF failure rate by that factor.

Of course, if all of that sounds too daunting and expensive, the easiest route is to perform a reliability analysis based on the expected lifetime of the components of your system. This is commonly done when one needs such an estimate but doesn't want to spend the time and money to measure it. There are companies which have databases of reliability data for various types of components classes (i.e. "Machine Screw, Size 4" fails with rate Y) which can be used in such an analysis. Your analysis should include the effect of the failure of a component on the device's function - for example, if the label fails quickly, you might not want to include that in your final calculation because it may have no effect on function.

Finally a note on terminology.

MTBF is the mean time between failures. This is the expected amount of time between failures when the device is repaired after each failure.

MTTF is the mean time to failure. This is the expected amount of time until a new device is expected to experience its first failure.

Per the [wikipedia page on MTBF][2], MTTF typically applies to non-repairable systems, while MTBF applies to systems that are expected to be repaired upon failure and then placed back into service.