Hadoop cluster. 2 Fast, 4 Medium, 8 slower machines

clusterhadoophardware

We're going to purchase some new hardware to use just for a Hadoop cluster and we're stuck on what we should purchase. Say we have a budget of $5k should we buy two super nice machines at $2500/each, four at around $1200/each or eight at around $600 each? Will hadoop work better with more slower machines or fewest much faster machines? Or, as like most things "it depends"? 🙂

Best Answer

If you can I would look at utilizing Cloud Infrastructure Services like Amazon Web Services (AWS) Elastic Compute Cloud (EC2), at least until you determine that it makes sense to invest in your own hardware. It's easy to get caught up in buying the shiny gear (I have to resist daily). By trying before you buy in the cloud you can learn a lot and answer the question: Does my companies software X or map/reduce framework against this data set best match a small, medium, or large set of server(s). I ran a number of combination's on AWS, scaling up, down, in, and out for pennies on the dollar within a few days. We were so happy with our testing that we decided to stay with AWS and forgo buying a large cluster of machines that we have to cool, power, maintain, etcetera. Instance types range from:

Standard Instances

  • Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform
  • Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform
  • Extra Large Instance 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform

High-CPU Instances

  • High-CPU Medium Instance 1.7 GB of memory, 5 EC2 Compute Units (2 virtual cores with 2.5 EC2 Compute Units each), 350 GB of instance storage, 32-bit platform

  • High-CPU Extra Large Instance 7 GB of memory, 20 EC2 Compute Units (8 virtual cores with 2.5 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform

EC2 Compute Unit (ECU) – One EC2 Compute Unit (ECU) provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

Standard On-Demand Instances Linux/UNIX Usage Windows Usage
Small (Default) $0.10 per hour $0.125 per hour
Large $0.40 per hour $0.50 per hour
Extra Large $0.80 per hour $1.00 per hour

High CPU On-Demand Instances Linux/UNIX Usage Windows Usage
Medium $0.20 per hour $0.30 per hour
Extra Large $0.80 per hour $1.20 per hour

Sorry to make an answer sound like a vendor pitch, but if your environment allows you to go this route, I think you'll be happy and make a much better purchase decision should you buy your own hardware in the future.