Hadoop: How to configure failover time for a datanode

failoverhadoophdfs

I need to re-replicate blocks on my HDFS cluster in case of a datanode is failing. Actually, this appears to already happen after a period of maybe 10min. However, I want to decrease this time, but wondering how to do so.

I tried to set dfs.namenode.check.stale.datanode but w/o any big success. So what is the configrations options or what are the options I have to adjust here to maybe decrease it to 1min?

The complete section out of hdfs-site.xml looks like this

<property>
    <name>dfs.namenode.check.stale.datanode</name>
    <value>true</value>
    <description>Activate stale check</description>
</property>

<property>
    <name>dfs.namenode.stale.datanode.interval</name>
    <value>10</value>
    <description>Timeout</description>
</property>

Best Answer

Based on an discussion on hadoop-user-mailing-list it appears that dfs.namenode.heartbeat.recheck-interval needs to be set inside hdfs-site.xml. The time until a datanode is marked as dead is calculate from this time in combination with dfs.heartbeat.interval. In fact a configuration

<property>
    <name>dfs.namenode.heartbeat.recheck-interval</name>
    <value>10000</value>
</property>

Resulted in ~45s until node has been marked dead. (this applies to 2.6 of Hadoop)

Related Topic