ZFS Pool Self-Destructing – Troubleshooting Tips

truenaszfs

Context

I recently noticed my freeNAS telling me it had issues with one drive.
I had about 16 bad sectors, went through the smart tests etc.
I bought a new drive, same capacity, went to install it, and for some reason a power adapter for one of the other drives came partly loose, so I was with 4 out of 6 drives in the RAID Z2 array, or basically no redundancy.

The array started resilvering, never completed, and always told me there were too many errors (14k+).
I figured out that power adapter part as it was unlikely to actually have two drives fail, especially with the second one failing right after opening the case.
I plugged it back in and ZFS couldn't do anything with it.

I ended up replacing the old drive (same drive but ZFS couldn't recognize it somehow, matched on gpt / smartctl / zpool) with itself, and ZFS went back to resilvering.

Of course, this still has all the same errors, now I also get a third drive resilvering for no reason, I did a few ZFS clears and scrubs, and it's still resilvering all day every day, failing, I clear, resilver some more and it's going nowhere.

Beyond the fact that I'm deeply disappointed in ZFS's inability to recover from this relatively low-risk situation where in fact only one drive has ever failed and was promptly replaced, the NAS and its main and only share are still usable, and I had done a backup after the first disk failure anyway.

Question

Is there any way to make ZFS understand that this pool is just fine and that it should just resilver the two new drives (one of which being an old one that I did wipe to help ZFS get that it could use it) and stop telling me about those errors ?

Like a resilver -force -scrub_later -everything_is_obviously_fine -or_i_couldnt_possibly_use_the_share -just_mark_it_all_online -lets_get_back_to_actual_work_now ?

Rambling

I'm kind of worried as right now it's pretending to me that it's resilvering 3 out of 6 drives in a raidz2 pool which clearly has usable data in it, which I seriously doubt anyone can even do.

I'm expecting it to bump that up to 4 drives soon, or maybe all 6 why not, recreating all my data out of residual magnetic dust buildup from the air surrounding the hard drives.

Any suggestion is appreciated. Thank you!

Best Answer

I never got an answer, and things got worse before they got better. Overall, after at least a dozen resilverings, scrubs, clears, removal of files that contained errors, and reboots, it ended up back online.

All in all, I think this mostly means that ZFS likes to give big warnings and the zpool status is not exactly clear, as resilvering 3 drives out of 6 in a raidz2 was not physically possible for one.

But mostly, as long as your data is still available and everything looks ok from a share usage standpoint, it'll probably end up ok like it did here, just keep on rebooting, scrubbing, clearing and dealing with files that have checksum errors.