Hadoop – cassandra and hadoop – realtime vs batch

cassandrahadoopnosql

As per http://www.dbta.com/Articles/Columns/Notes-on-NoSQL/Cassandra-and-Hadoop—Strange-Bedfellows-or-a-Match-Made-in-Heaven-75890.aspx

Cassandra has pursued somewhat different solutions than has Hadoop. Cassandra excels at high-volume real-time transaction processing, while Hadoop excels at more batch-oriented analytical solutions.

What are the differences in the architecture/implementation of Cassandra and Hadoop which account for this sort of difference in usage. (in lay software professional terms)

Best Answer

I wanted to add, because I think there might be a misleading statement here saying Cassandra might perform good for reads. Cassandra is not very good at random reads either, it's good compared to other solutions out there in how can you read randomly over a huge amount of data, but at some point if the reads are truly random you can't avoid hitting the disk every single time which is expensive, and it may come down to something useless like a few thousand hits/second depending on your cluster, so planning on doing lots of random queries might not be the best, you'll run into a wall if you start thinking like that. I'd say everything in big data works better when you do sequential reads or find a way to sequentially store them. Most cases even when you do real time processing you still want to find a way to batch your queries. This is why you need to think beforehand what you store under a key and try to get the most information possible out of a read. It's also kind of funny that statement says transaction and Cassandra in the same sentence, cause that really doesn't happen. On the other hand hadoop is meant to be batch almost by definition, but hadoop is a distributed map reduce framework, not a db, in fact, I've seen and used lots of hadoop over cassandra, they're not antagonistic technologies. Handling your big data in real time is doable but requires good thinking and care about when and how you hit the database.

Edit: Removed secondary indices example, as last time I checked that used random reads (though I've been away from Cassandra for more than a year now).

Related Topic