How best to merge/sort/page through tons of JSON arrays

datahadoopjsonnosqlperformance

Here's the scenario: Say you have millions of JSON documents stored as text files. Each JSON document is an array of "activity" objects, each of which contain a "created_datetime" attribute. What is the best way to merge/sort/filter/page through these activities via a web UI? For example, say we want to take a few thousand of the documents, merge them into a gigantic array, sort the array by the "created_datetime" attribute descending and then page through it 10 activities at a time.

Also keep in mind that roughly 25% of these JSON documents are updated every day, and updates have to make it into the view within 5 minutes.

My first thought is to parse all of the documents into an RDBMS table and then it would just be a simple query such as "select top 10 name, created_datetime from Activity where user_id=12345 order by created_datetime desc".

Some have suggested I use NoSQL techniques such as hadoop or map/reduce instead. How exactly would this work?

For more background, see: Why is NoSQL better for this scenario?

Best Answer

How to merge/sort/page through a huge amount of data?

Well, for sorting, look at Quicksort if the data is more or less randomized, or Timsort if it's highly ordered. (Quicksort degenerates easily into horrible performance on highly ordered data.)

For merging, there's a pretty simple algorithm for this: list comparison.

  • Take two lists, list A and list B. Sort both lists using the same criteria.
  • Declare two iterators that refer to a single list element, one for each list. Initialize them to reference the first element of their respective lists.
  • Repeat
    • Compare referenced element A (eA) to referenced element B (eB)
    • If eA < eB then handle eA appropriately and increment iterator-A
    • else if eB < eA then eB appropriately and increment iterator-B
    • else handle equality case appropriately and increment both iterators
  • until one iterator reaches the end of its list.
  • (optional) Handle the remaining elements in the other list, if appropriate

This basic concept can be used for a number of operations involving two lists, including merging, by specifying the "handle appropriately" cases. In this case, the appropriate way to handle it is to add an element to the output list.

For paging, let your database engine handle this one.

Related Topic