SSD – Mean Time Between Failures Explained

drive-failuressd

The Mean Time Between Failures, or MTBF, for this SSD is listed as 1,500,000 hours.

That is a lot of hours. 1,500,000 hours is roughly 170 years. Since the invention of this particular SSD is post-Civil War, how do they know what the MTBF is?

A couple of options that make sense to me:

  • Newegg just has a typo
  • The definition of mean time between failures is not what I think it is
  • They are using some type of statistical extrapolation to estimate what the MTBF would be

Question:

How is the Mean Time Between Failures (MTFB) obtained for SSD/HDDs?

Best Answer

Drive manufacturers specify the reliability of their products in terms of two related metrics: the annualized failure rate (AFR), which is the percentage of disk drives in a population that fail in a test scaled to a per year estimation; and the mean time to failure (MTTF).

The AFR of a new product is typically estimated based on accelerated life and stress tests or based on field data from earlier products. The MTTF is estimated as the number of power on hours per year divided by the AFR. A common assumption for drives in servers is that they are powered on 100% of the time.

http://www.cs.cmu.edu/~bianca/fast/

MTTF of 1.5 million hours sounds somewhat plausible.

That would roughly be a test with 1000 drives running for 6 months and 3 drives failing.
The AFR would be (2* 6 months * 3)/(1000 drives)=0.6% annually and the MTTF = 1yr/0.6%=1,460,967 hours or 167 years.

A different way to look at that number is when you have 167 drives and leave them running for a year the manufacturer claims that on average you'll see one drive fail.

But I expect that is simply the constant "random" mechanical/electronic failure rate.

Assuming that failure rates follow the bathtub curve, as mentioned in the comments, the manufacturer's marketing team can massage the reliability numbers a bit, for instance by not including DOA'S (dead on arrival, units that passed quality control but fail when the end-user installs them) and stretching the DOA definition to also exclude those in the early failure spike. And because testing isn't performed long enough you won't see age effects either.

I think the warranty period is a better indication for how long a manufacturer really expects a SSD to last!
That definitely won't be measured in decades or centuries...


Associated with the MTBF is the reliability associated with the finite number of write cycles NAND cells can support. A common metric is the total write capacity, usually in TB. In addition to other performance requirements that is one big limiter.

To allow a more convenient comparison between different makes and differently sized sized drives the write endurance is often converted to daily write capacity as a fraction of the disk capacity.

Assuming that a drive is rated to live as long as it's under warranty:
a 100 GB SSD may have a 3 year warranty and a write capacity 50 TB:

        50 TB
---------------------  = 0.46 drive per day write capacity.
3 * 365 days * 100 GB

The higher that number, the more suited the disk is for write intensive IO.
At the moment (end of 2014) value server line SSD's have a value of 0.3-0.8 drive/day, mid-range is increasing steadily from 1-5 and high-end seems to sky-rocket with write endurance levels of up to 25 * the drive capacity per day for 3-5 years.

Some real world tests show that sometimes the vendor claims can be massively exceeded, but driving equipment way past the vendor limits isn't always an enterprise consideration... Instead buy correctly spec'd drives for your purposes.

Related Topic