Linux – ext4 on faulty disks. How to avoid remount read-only

bad-blocksext4hadooplinux

The Problem:

I'm in charge for a Hadoop cluster of 44 nodes. We have 1.5TB WD Green Drives with (quite unknown) the Load Cycle Count problem.

These disks work fine but as they get older they show an increasing number of bad blocks. Rewriting these bad blocks works for some time but they re-appear on different places.

As most of these disks are only used for Hadoop datanodes and we don't have the budget to replace them all I'm looking for a strategy to

  1. Not go insane mainting the cluster, disk errors and related filesystem problems appear almost daily. My current precedure is:

    • stop Hadoop services, unmount disks, locate bad blocks using dmesg output and smartctl and rewrite these bad blocks with hdparm --write-sector.
    • running fsck -f -y on the disk and remount it.
  2. Keep the system stable.

    • Hadoop takes care of disk errors (3x redudancy), but I'd rather don't want to risk corrupted filesystems.

What did I do?

At the moment I've changed the mount options to:

  • erros=continue,noatime but I get the occosial read-only remount because of journaling erros.

Then I've tried disabling the journal:

  • tune2fs -O ^has_journal this avoid read-only remounts but seems to corrupt the filesystem (which makes sense, no journal)

Now I'm thinking about switching to

  • tune2fs -o journal_data_writeback and mount with data=writeback,nobh,barrier=0

But I'm not sure if this re-introduces the read-only remounts.

So, I'd like to avoid read-only remounts, want to maintain stable filesystem metadata but don't care about errors in the data (Hadoop takes care of this). Speed should also not be impacted.

What choices do I have? I'm aware that this is probably a nightmare story for any sysadmin. OS partitions are mounted with full journaling and I'm not going to test around on production data. This is strictly for Hadoop data nodes / task tracker hard disks.

Best Answer

The best thing you can do is get the disks replaced. The cost of disks won't weigh up against the cost of the cluster being down and your amount of work time being put in to fix the bad blocks. So even without a budget I would seriously try to convince your management.