Linux – ext4 on faulty disks. How to avoid remount read-only

The Problem:

I'm in charge for a Hadoop cluster of 44 nodes. We have 1.5TB WD Green Drives with (quite unknown) the Load Cycle Count problem.

These disks work fine but as they get older they show an increasing number of bad blocks. Rewriting these bad blocks works for some time but they re-appear on different places.

As most of these disks are only used for Hadoop datanodes and we don't have the budget to replace them all I'm looking for a strategy to

Not go insane mainting the cluster, disk errors and related filesystem problems appear almost daily. My current precedure is:
- stop Hadoop services, unmount disks, locate bad blocks using dmesg output and smartctl and rewrite these bad blocks with hdparm --write-sector.
- running fsck -f -y on the disk and remount it.
Keep the system stable.
- Hadoop takes care of disk errors (3x redudancy), but I'd rather don't want to risk corrupted filesystems.

What did I do?

At the moment I've changed the mount options to:

erros=continue,noatime but I get the occosial read-only remount because of journaling erros.

Then I've tried disabling the journal:

tune2fs -O ^has_journal this avoid read-only remounts but seems to corrupt the filesystem (which makes sense, no journal)

Now I'm thinking about switching to

tune2fs -o journal_data_writeback and mount with data=writeback,nobh,barrier=0

But I'm not sure if this re-introduces the read-only remounts.

So, I'd like to avoid read-only remounts, want to maintain stable filesystem metadata but don't care about errors in the data (Hadoop takes care of this). Speed should also not be impacted.

What choices do I have? I'm aware that this is probably a nightmare story for any sysadmin. OS partitions are mounted with full journaling and I'm not going to test around on production data. This is strictly for Hadoop data nodes / task tracker hard disks.

Linux – ext4 on faulty disks. How to avoid remount read-only

The Problem:

What did I do?

Best Answer

Related Topic

The Problem:

What did I do?

Best Answer

Related Solutions

Hadoop disk fail, what do you do

Ubuntu – How to disable ext4 has_journal option on dedicated server

Related Topic