HDFS datanode startup fails when disks are full

hadoophdfs

Our HDFS cluster is only 90% full but some datanodes have some disks that are 100% full. That means when we mass reboot the entire cluster some datanodes completely fail to start with a message like this:

2013-10-26 03:58:27,295 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Mkdirs failed to create /mnt/local/sda1/hadoop/dfsdata/blocksBeingWritten

Only three have to fail this way before we start experiencing real data loss.

Currently we workaround it by decreasing the amount of space reserved for the root user but we'll eventually run out. We also run the re-balancer pretty much constantly, but some disks stay stuck at 100% anyway.

Changing the dfs.datanode.failed.volumes.tolerated setting is not the solution as the volume has not failed.

Any ideas?

Best Answer

As per this default HDFS parameter, the dfs.datanode.du.reserved is per volume. So if you set it to say 10 GB and your datanode is having 4 volumes configured for HDFS, it will set aside 40 GB for non DFS use.

Related Solutions

Linux – ext4 on faulty disks. How to avoid remount read-only

The best thing you can do is get the disks replaced. The cost of disks won't weigh up against the cost of the cluster being down and your amount of work time being put in to fix the bad blocks. So even without a budget I would seriously try to convince your management.

How to fix Hadoop HDFS cluster with missing blocks after one node was reinstalled

I ended up having to delete the files with bad blocks, which after further investigation, I realized had a very low replication (rep=1 if I recall correctly).

This SO post has more information on finding the files with bad blocks, using something along the lines of:

hadoop fsck / | egrep -v '^\.+$' | grep -v eplica

So, to answer my own questions:

Can these files be recovered? Not unless the failed nodes/drives are brought back online with the missing data.
How do I get out of safe mode? Remove these troublesome files, and then leave safe mode via dfsadmin.

Best Answer

Related Solutions

Linux – ext4 on faulty disks. How to avoid remount read-only

How to fix Hadoop HDFS cluster with missing blocks after one node was reinstalled

Related Topic