HDFS datanode startup fails when disks are full


Our HDFS cluster is only 90% full but some datanodes have some disks that are 100% full. That means when we mass reboot the entire cluster some datanodes completely fail to start with a message like this:

2013-10-26 03:58:27,295 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Mkdirs failed to create /mnt/local/sda1/hadoop/dfsdata/blocksBeingWritten

Only three have to fail this way before we start experiencing real data loss.

Currently we workaround it by decreasing the amount of space reserved for the root user but we'll eventually run out. We also run the re-balancer pretty much constantly, but some disks stay stuck at 100% anyway.

Changing the dfs.datanode.failed.volumes.tolerated setting is not the solution as the volume has not failed.

Any ideas?

Best Answer

As per this default HDFS parameter, the dfs.datanode.du.reserved is per volume. So if you set it to say 10 GB and your datanode is having 4 volumes configured for HDFS, it will set aside 40 GB for non DFS use.