Hadoop disk fail, what do you do

failoverhadoophard drivehardware

I would like to know about your strategies on what to do when one of the Hadoop server disk fails.

Let's say, I have multiple (>15) Hadoop servers and 1 namenode, and one from 6 disks on slaves stops working, disks are connected via SAS. I don't care about retrieving data from this disk, but for general strategies for keeping cluster running.

What do you do?

Best Answer

We deployed hadoop. You can specify replication numbers for files. How many times a file gets replicated. Hadoop has a single point of failure on the namenode. If you are worried about disks going out, increase replication to 3 or more.

Then if a disk goes bad, it's very simple. Throw it out and reformat. Hadoop will adjust automatically. In fact as soon as a disk goes out, it will start rebalancing files to maintain the replication numbers.

I am not sure why you have such a large bounty. You said you don't care to retrieve data. Hadoop only has a single point of failure on the name node. All other nodes are expendable.

Related Topic