Hadoop
is a platform that allows us to store and process large volumes of data across clusters of machines in a parallel manner..It is a batch processing system where we don't have to worry about the internals of data storage or processing.
It not only provides HDFS, the distributed file system for reliable data storage but also a processing framework, MapReduce, that allows processing of huge data sets across clusters of machines in a parallel manner.
One of the biggest advantage of Hadoop is that it provides data locality.By that I mean that moving data that is do huge is costly.
So Hadoop moves computation to the data.Both Hdfs and MapReduce are highly optimized to work with really large data.
HDFS assures high availability and failover through data replication, so that if any one the machines in your cluster is down because of some catastrophe, your data is still safe and available.
On the other hand HBase is a NoSQL database
.We can think of it as a distributed, scalable, big data store.
It is used to overcome the pitfalls of Hdfs like "inability of random read and write".
Hbase is a suitable choice if we need random, realtime read/write access to our data.It was modeled after Google's "BigTable", while Hdfs was modeled after the GFS(Google file system).
It is not necessary to use Hbase on top Hdfs only.We can use Hbase with other persistent store like "S3" or "EBS".
If you want to know about Hadoop and Hbase in deatil, you can visit the respective home pages -"hadoop.apache.org" and "hbase.apache.org".
You can also go through the following books if you want to learn in depth "Hadoop.The.Definitive.Guide" and "HBase.The.Definitive.Guide".
Hadoop is basically 3 things, a FS (Hadoop Distributed File System), a computation framework (MapReduce) and a management bridge (Yet Another Resource Negotiator). HDFS allows you store huge amounts of data in a distributed (provides faster read/write access) and redundant (provides better availability) manner. And MapReduce allows you to process this huge data in a distributed and parallel manner. But MapReduce is not limited to just HDFS. Being a FS, HDFS lacks the random read/write capability. It is good for sequential data access. And this is where HBase comes into picture. It is a NoSQL database that runs on top your Hadoop cluster and provides you random real-time read/write access to your data.
You can store both structured and unstructured data in Hadoop, and HBase as well. Both of them provide you multiple mechanisms to access the data, like the shell and other APIs.
And, HBase stores data as key/value pairs in a columnar fashion while HDFS stores data as flat files. Some of the salient features of both the systems are :
Hadoop
- Optimized for streaming access of large files.
- Follows write-once read-many ideology.
- Doesn't support random read/write.
HBase
- Stores key/value pairs in columnar fashion (columns are clubbed together as column families).
- Provides low latency access to small amounts of data from within a large data set.
- Provides flexible data model.
Hadoop is most suited for offline batch-processing kinda stuff while HBase is used when you have real-time needs.
An analogous comparison would be between MySQL and Ext4.
Best Answer
I don't think either is better than the others, it's not just one or the other. These are very different systems, each with their strengths and weaknesses, so it really depends on your use cases. They can definitely be used in complement of one another in the same infrastructure.
To explain the difference better I'd like to borrow a picture from Cassandra: the Definitive Guide, where they go over the CAP theorem. What they say is basically for any distributed system, you have to find a balance between consistency, availability and partition tolerance, and you can only realistically satisfy 2 of these properties. From that you can see that:
When it comes to Hadoop, HBase is built on top of HDFS, which makes it pretty convenient to use if you already have a Hadoop stack. It is also supported by Cloudera, which is a standard enterprise distribution for Hadoop.
But Cassandra also has more integration with Hadoop, namely Datastax Brisk which is gaining popularity. You can also now natively stream data from the output of a Hadoop job into a Cassandra cluster using some Cassandra-provided output format (
BulkOutputFormat
for example), we are no longer to the point where Cassandra was just a standalone project.In my experience, I've found that Cassandra is awesome for random reads, and not so much for scans
To put a little color to the picture, I've been using both at my job in the same infrastructure, and HBase has a very different purpose than Cassandra. I've used Cassandra mostly for real-time very fast lookups, while I've used HBase more for heavy ETL batch jobs with lower latency requirements.
This is a question that would truly be worthy of a blog post, so instead of going on and on I'd like to point you to an article which sums up a lot of the keys differences between the 2 systems. Bottom line is, there is no superior solution IMHO, and you should really think about your use cases to see which system is better suited.