hi I am new to hbase and hadoop. I couldn't find that why we are using hadoop with hbase. I know hadoop is a file system but I read that we can use hbase without hadoop so why are we using hadoop?
thx
Hadoop and HBase
hadoophbase
Related Solutions
MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.
Coming back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.
Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.
While Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.
I had written an article on a short comparison of different tools of the Hadoop ecosystem some time ago. It's not an in depth comparison, but a short intro to each of these tools which can help you to get started. (Just to add on to my answer. No self promotion intended)
Both Hive and Pig queries get converted into MapReduce jobs under the hood.
HTH
Hadoop is basically 3 things, a FS (Hadoop Distributed File System), a computation framework (MapReduce) and a management bridge (Yet Another Resource Negotiator). HDFS allows you store huge amounts of data in a distributed (provides faster read/write access) and redundant (provides better availability) manner. And MapReduce allows you to process this huge data in a distributed and parallel manner. But MapReduce is not limited to just HDFS. Being a FS, HDFS lacks the random read/write capability. It is good for sequential data access. And this is where HBase comes into picture. It is a NoSQL database that runs on top your Hadoop cluster and provides you random real-time read/write access to your data.
You can store both structured and unstructured data in Hadoop, and HBase as well. Both of them provide you multiple mechanisms to access the data, like the shell and other APIs. And, HBase stores data as key/value pairs in a columnar fashion while HDFS stores data as flat files. Some of the salient features of both the systems are :
Hadoop
- Optimized for streaming access of large files.
- Follows write-once read-many ideology.
- Doesn't support random read/write.
HBase
- Stores key/value pairs in columnar fashion (columns are clubbed together as column families).
- Provides low latency access to small amounts of data from within a large data set.
- Provides flexible data model.
Hadoop is most suited for offline batch-processing kinda stuff while HBase is used when you have real-time needs.
An analogous comparison would be between MySQL and Ext4.
Best Answer
Hadoop
is a platform that allows us to store and process large volumes of data across clusters of machines in a parallel manner..It is a batch processing system where we don't have to worry about the internals of data storage or processing.It not only provides HDFS, the distributed file system for reliable data storage but also a processing framework, MapReduce, that allows processing of huge data sets across clusters of machines in a parallel manner.
One of the biggest advantage of Hadoop is that it provides data locality.By that I mean that moving data that is do huge is costly. So Hadoop moves computation to the data.Both Hdfs and MapReduce are highly optimized to work with really large data.
HDFS assures high availability and failover through data replication, so that if any one the machines in your cluster is down because of some catastrophe, your data is still safe and available.
On the other hand HBase is a
NoSQL database
.We can think of it as a distributed, scalable, big data store. It is used to overcome the pitfalls of Hdfs like "inability of random read and write".Hbase is a suitable choice if we need random, realtime read/write access to our data.It was modeled after Google's "BigTable", while Hdfs was modeled after the GFS(Google file system).
It is not necessary to use Hbase on top Hdfs only.We can use Hbase with other persistent store like "S3" or "EBS". If you want to know about Hadoop and Hbase in deatil, you can visit the respective home pages -"hadoop.apache.org" and "hbase.apache.org".
You can also go through the following books if you want to learn in depth "Hadoop.The.Definitive.Guide" and "HBase.The.Definitive.Guide".