Java C++ Big Data – Why is the Hadoop Ecosystem Written in Java?

big datachadoopjava

Developing Big Data processing pipelines and storage, you probably come across software which is more or less a part of the Hadoop ecosystem. Be it Hadoop itself, Spark/Flink, HBase, Kafka, Accumulo, etc.

Now all of these have been very well implemented, offering fast and high-quality solutions to the developers needs. Still, especially with the Big Data usage patterns in mind, a huge amount of object allocations and deallocations happen. It is probably worthwhile to use a non-garbage collected language, like C++.

Another reason I could find for myself, why Java applications are so popular in this domain, is the distributed deployment. One key characteristic of Big Data applications is the size, they don't fit on a single machine. The JVM allows really simple deployment (just copy the bytecode around). But is this really an argument? Looking at our own cluster, the hardware is quite similar and I would assume that this holds true for most companies. So even compiled machine code should be easy to move around to all machines.

For me personally, the biggest reason would probably be DRY (don't repeat yourself). It started in Java and libraries and frameworks grew around it. They work very well and nobody is willing to invest in rewriting the whole stack in a different programming language for (if at all) marginal gain.

Maybe someone of you has a deeper insight than me?

Best Answer

Hadoop was originally written in Java, because it was used to "fix" problems in Nutch, which also was written in Java. Nutch, in turn, was written in Java because it was a write once run anywhere solution.

As for whether C++ or another language would have been a better choice, that's definitely up for debate. With modern architectures, I'd trust Java or C#'s garbage collector over a random developer's judgement. For most applications, we don't need to be heavily concerned with resource usage, beyond normal best practices, unlike the early days of computing where every bit was important and needed to be managed.

However, Big Data is definitely an outlier for that approach. I still would have a developer who understood how Java's garbage collection worked code in Java than trust a developer in C++ to know how to do garbage collection well.

That said, this will almost always get into a debate about Java and C# developers being spoiled by their frameworks, and as a C# developer, I'd always rather have a library written and tested by a team of professionals (or a library written and tested and used by the masses) than try to do it myself. Instead of knowing how to manually allocate memory and manage it (which I can do in C, but haven't since school) I'd rather just understand how the C# garbage collector works.

Related Topic