How important is it to burn in a hard drive before you start using it?
If you have a good backup, and good high-availability systems, then not very much. Since restoring from a failure should be pretty easy.
How do you implement a burn-in process?
What software do you use to burn in drives?
How much stress is too much for a burn-in process?
I will typically run badblocks against a drive or new system when I get it. I will run it whenever I resurrect a computer from the spares pile. A command like this (badblocks -c 2048 -sw /dev/sde
) will actually write to every block 4 times each time with a different pattern (0xaa, 0x55, 0xff, 0x00). This test does not do anything to test lots of random reads/writes, but it should prove that every block can be written too and read.
You could also run bonnie++, or iometer which are benchmarking tools. These should try to stress your drives a bit. Drives shouldn't fail even if you try to max them out. So you might as well try to see what they can do. I do not do this though. Getting an I/O benchmark of your storage system right at install/setup time may be very useful in the future when you are looking at performance issues.
How long do you burn in a hard drive?
A single run of badblocks is enough in my opinion, but I believe I have a very strong backup system, and my HA needs are not that high. I can afford some downtime to restore service on most of the systems I support. If you are so worried, that you think a multi-pass setup may be required, then you probably should have RAID, good backups, and a good HA setup anyway.
If I am in a rush, I may skip a burn-in. My backups, and RAID should be fine.
You can't reliably.
Or rather, you have already done it with the options at your disposal.
As a study at google found out, failing disks do not necessarily show abnormal SMART values (The other way round however is more reliable: when they do, they will fail).
Keeping this aside for a moment, bear in mind that even though alot is standardized in computing, in reality there are bugs in both hard- and software, error margins which can accumulate, etc. The real world isn't perfect, and it's not unseen of hard disks not playing nice with particular controllers - and the other way round. Sometimes it's a question of a faulty firmware, sometimes some completely different system components not behaving, for example a sub-par PSU which barfs at particular load spikes. Or even temperature changes, age...the list could be expanded almost at will.
So, standard procedure here is to put the disk into a significantly different system configuration and re-run tests - but since you already have done so with the complete change of your system, you have correctly concluded that the disk must be at fault. (Unless you did not change everything else as you've told us - Cable/HBA comes to mind, in which case the assumption would not hold true).
Edit: I just realized that there is one option left; you can search if there are newer firmware revisions available for this disk drive than what's currently on your particular drive. If so, you may have a look at the change log pointing out possible problems in your case.
In conclusion, to establish with complete confidence (in this particular situation!) that the drive is misbehaving, you'll need to send it back to the manufacturer.
Best Answer
On some drives, the raw values should not be considered true literal values, rather they are a "combo field" collapsing various related informations. On these drives, an high ECC hardware recovery value or an high RAW error value should not be considered an indication of an imminent failure.
On these drivers, rather than the raw values, please consider the normalized ones (value/threshold/worst)
Anyway, it's a Seagate drive, right?