Java – Data Aggregation of CSV Files

aggregatebig datadatabasehashingjava

I have k csv files (5 csv files for example), each file has m fields which produce a key and n values. I need to produce a single csv file with aggregated data.

I'm looking for the most efficient solution for this problem, speed mainly. I don't think by the way that we will have memory issues. Also I would like to know if hashing is really a good solution because we will have to use 64 bit hashing solution to reduce the chance for a collision to less than 1% (we are having around 30000000 rows per aggregation).

For example

file 1: f1,f2,f3,v1,v2,v3,v4
        a1,b1,c1,50,60,70,80
        a3,b2,c4,60,60,80,90 

file 2: f1,f2,f3,v1,v2,v3,v4
        a1,b1,c1,30,50,90,40
        a3,b2,c4,30,70,50,90

result: f1,f2,f3,v1,v2,v3,v4  
        a1,b1,c1,80,110,160,120
        a3,b2,c4,90,130,130,180

algorithm that we thought until now:

  1. hashing (using concurentHashTable)

  2. merge sorting the files

  3. DB: using mysql or hadoop or redis.

The solution needs to be able to handle Huge amount of data (each file more than two million rows)

a better example: file 1

country,city,peopleNum
england,london,1000000
england,coventry,500000

file 2:

country,city,peopleNum
england,london,500000
england,coventry,500000
england,manchester,500000

merged file:

country,city,peopleNum
england,london,1500000
england,coventry,1000000
england,manchester,500000

The key is: country,city. This is just an example, my real key is of size 6 and the data columns are of size 8 – total of 14 columns.

We would like that the solution will be the fastest in regard of data processing.

Best Answer

Since the key-fileds are always the first colums i would sort the source rows (whithout the header) by the keys and then advance in both files line by line similar to Merge_algorithm used in Mergesort.

But instead of sorting the 2 csv-lists to one you compute the sum of elements that are in both list. Elements that are in one list only are simply copied.

The algorithm looks similar to this:

While (NOT EndOfFile(left-item) AND NOT EndOfFile(right-item))
  if (EndOfFile(right-item) OR left-item.key < right-item.key) store(left-item); advance left-item;
  if (EndOfFile(left-item) OR left-item.key > right-item.key) store(right-item); advance right-item;
  if (left-item.key = right-item.key) store(sum(right-item, left-item)); advance right-item; advance left-item
Related Topic