Hadoop – realtime querying/aggregating millions of records – hadoop? hbase? cassandra

cassandrahadoophbasehivenosql

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!

Here's what I have:

1000s of datasets
dataset keys:
- all datasets have the same keys
- 1 million keys (this may later be 10 or 20 million)
dataset columns:
- each dataset has the same columns
- 10 to 20 columns
- most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
- a few columns are "type_id" columns, since in a particular query we may
  want to only include certain type_ids
web application
- user can choose which datasets they are interested in (anywhere from 15 to 1000)
- application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
- an entire dataset can be added, dropped, or replaced/updated
- would be cool to be able to add columns. But, if required, can just replace the entire dataset.
- never add rows/keys to a dataset – so don't need a system with lots of fast writes
infrastructure:
- currently two machines with 24 cores each
- eventually, want ability to also run this on amazon

I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.

partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured

As a proof of concept I tried out hadoop:

created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation

From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).

Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?

thanks!

Best Answer

Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds

HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.

check out http://en.wikipedia.org/wiki/Standard_deviation

stddev(X) = sqrt(E[X^2]- (E[X])^2)

this implies that you can get the stddev of AB by doing

sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)

Best Answer

Related Solutions

Hadoop – When to use Hadoop, HBase, Hive and Pig

Hadoop – Why HBase is a better choice than Cassandra with Hadoop

Related Topic