Python – What language could I use for fast execution of this database summarization task

apache-piglispocamlpythonsql

So I wrote a Python program to handle a little data processing
task.

Here's a very brief specification in a made-up language of the computation I want:

parse "%s %lf %s" aa bb cc | group_by aa | quickselect --key=bb 0:5 | \
    flatten | format "%s %lf %s" aa bb cc

That is, for each line, parse out a word, a floating-point number, and another word. Think of them as a player ID, a score, and a date. I want the top five scores and dates for each player. The data size is not trivial, but not huge; about 630 megabytes.

I want to know what real, executable language I should have written it in to
get it to be similarly short (as the Python below) but much faster.

#!/usr/bin/python
# -*- coding: utf-8; -*-
import sys

top_5 = {}

for line in sys.stdin:
    aa, bb, cc = line.split()

    # We want the top 5 for each distinct value of aa.  There are
    # hundreds of thousands of values of aa.
    bb = float(bb)
    if aa not in top_5: top_5[aa] = []
    current = top_5[aa]
    current.append((bb, cc))

    # Every once in a while, we drop the values that are not in
    # the top 5, to keep our memory footprint down, because some
    # values of aa have thousands of (bb, cc) pairs.
    if len(current) > 10:
        current.sort()
        current[:-5] = []

for aa in top_5:
    current = top_5[aa]
    current.sort()
    for bb, cc in current[-5:]:
        print aa, bb, cc

Here’s some sample input data:

3 1.5 a
3 1.6 b
3 0.8 c
3 0.9 d
4 1.2 q
3 1.5 e
3 1.8 f
3 1.9 g

Here’s the output I get from it:

3 1.5 a
3 1.5 e
3 1.6 b
3 1.8 f
3 1.9 g
4 1.2 q

There are seven values for 3, and so we drop the c and d values
because their bb value puts them out of the top 5. Because 4 has
only one value, its “top 5” consists of just that one value.

This runs faster than doing the same queries in MySQL (at least, the
way we’ve found to do the queries) but I’m pretty sure it's spending
most of its time in the Python bytecode interpreter. I think that in
another language, I could probably get it to process hundreds of
thousands of rows per second instead of per minute. So I’d like to
write it in a language that has a faster implementation.

But I’m not sure what language to choose.

I haven’t been able to figure out how to express this as a single query in SQL, and
actually I’m really unimpressed with MySQL’s ability even to merely
select * from foo into outfile 'bar'; the input data.

C is an obvious choice, but things like line.split(), sorting a list
of 2-tuples, and making a hash table require writing some code that’s
not in the standard library, so I would end up with 100 lines of code
or more instead of 14.

C++ seems like it might be a better choice (it has strings, maps,
pairs, and vectors in the standard library) but it seems like the code
would be a lot messier with STL.

OCaml would be fine, but does it have an equivalent of line.split(),
and will I be sad about the performance of its map?

Common Lisp might work?

Is there some equivalent of Matlab for database computation like this
that lets me push the loops down into fast code? Has anybody tried Pig?

(Edit: responded to davethegr8's comment by providing some sample input and output data, and fixed a bug in the Python program!)

(Additional edit: Wow, this comment thread is really excellent so far. Thanks, everybody!)

Edit:

There was an eerily similar question asked on sbcl-devel in 2007 (thanks, Rainer!), and here's an awk script from Will Hartung for producing some test data (although it doesn't have the Zipfian distribution of the real data):

BEGIN {
 for (i = 0; i < 27000000; i++) {
  v = rand();
  k = int(rand() * 100);
  print k " " v " " i;
 }
 exit;
}

Best Answer

I have a hard time believing that any script without any prior knowledge of the data (unlike MySql which has such info pre-loaded), would be faster than a SQL approach.

Aside from the time spent parsing the input, the script needs to "keep" sorting the order by array etc...

The following is a first guess at what should work decently fast in SQL, assuming a index (*) on the table's aa, bb, cc columns, in that order. (A possible alternative would be an "aa, bb DESC, cc" index

(*) This index could be clustered or not, not affecting the following query. Choice of clustering or not, and of needing an "aa,bb,cc" separate index depends on use case, on the size of the rows in table etc. etc.

SELECT T1.aa, T1.bb, T1.cc , COUNT(*)
FROM tblAbc T1
LEFT OUTER JOIN tblAbc T2 ON T1.aa = T2.aa AND 
         (T1.bb < T2.bb OR(T1.bb = T2.bb AND T1.cc < T2.cc))
GROUP BY T1.aa, T1.bb, T1.cc
HAVING COUNT(*) < 5  -- trick, remember COUNT(*) goes 1,1,2,3,...
ORDER BY T1.aa, T1.bb, T1.cc, COUNT(*) DESC

The idea is to get a count of how many records, within a given aa value are smaller than self. There is a small trick however: we need to use LEFT OUTER join, lest we discard the record with the biggest bb value or the last one (which may happen to be one of the top 5). As a result of left joining it, the COUNT(*) value counts 1, 1, 2, 3, 4 etc. and the HAVING test therefore is "<5" to effectively pick the top 5.

To emulate the sample output of the OP, the ORDER BY uses DESC on the COUNT(), which could be removed to get a more traditional top 5 type of listing. Also, the COUNT() in the select list can be removed if so desired, this doesn't impact the logic of the query and the ability to properly sort.

Also note that this query is deterministic in terms of the dealing with ties, i,e, when a given set of records have a same value for bb (within an aa group); I think the Python program may provide slightly different outputs when the order of the input data is changed, that is because of its occasional truncating of the sorting dictionary.

Real solution: A SQL-based procedural approach

The self-join approach described above demonstrates how declarative statements can be used to express the OP's requirement. However this approach is naive in a sense that its performance is roughly bound to the sum of the squares of record counts within each aa 'category'. (not O(n^2) but roughly O((n/a)^2) where a is the number of different values for the aa column) In other words it performs well with data such that on average the number of records associated with a given aa value doesn't exceed a few dozens. If the data is such that the aa column is not selective, the following approach is much -much!- better suited. It leverages SQL's efficient sorting framework, while implementing a simple algorithm that would be hard to express in declarative fashion. This approach could further be improved for datasets with particularly huge number of records each/most aa 'categories' by introducing a simple binary search of the next aa value, by looking ahead (and sometimes back...) in the cursor. For cases where the number of aa 'categories' relative to the overall row count in tblAbc is low, see yet another approach, after this next one.

DECLARE @aa AS VARCHAR(10), @bb AS INT, @cc AS VARCHAR(10)
DECLARE @curAa AS VARCHAR(10)
DECLARE @Ctr AS INT

DROP TABLE  tblResults;
CREATE TABLE tblResults
(  aa VARCHAR(10),
   bb INT,
   cc VARCHAR(10)
);

DECLARE abcCursor CURSOR 
  FOR SELECT aa, bb, cc
  FROM tblABC
  ORDER BY aa, bb DESC, cc
  FOR READ ONLY;

OPEN abcCursor;

SET @curAa = ''

FETCH NEXT FROM abcCursor INTO @aa, @bb, @cc;
WHILE @@FETCH_STATUS = 0
BEGIN
    IF @curAa <> @aa
    BEGIN
       SET @Ctr = 0
       SET @curAa = @aa
    END
    IF @Ctr < 5
    BEGIN
       SET @Ctr = @Ctr + 1;
       INSERT tblResults VALUES(@aa, @bb, @cc);
    END
    FETCH NEXT FROM AbcCursor INTO @aa, @bb, @cc;
END;

CLOSE abcCursor;
DEALLOCATE abcCursor;

SELECT * from tblResults
ORDER BY aa, bb, cc    -- OR .. bb DESC ... for a more traditional order.

Alternative to the above for cases when aa is very unselective. In other words, when we have relatively few aa 'categories'. The idea is to go through the list of distinct categories and to run a "LIMIT" (MySql) "TOP" (MSSQL) query for each of these values. For reference purposes, the following ran in 63 seconds for tblAbc of 61 Million records divided in 45 aa values, on MSSQL 8.0, on a relatively old/weak host.

DECLARE @aa AS VARCHAR(10)
DECLARE @aaCount INT

DROP TABLE  tblResults;
CREATE TABLE tblResults
(  aa VARCHAR(10),
   bb INT,
   cc VARCHAR(10)
);

DECLARE aaCountCursor CURSOR 
  FOR SELECT aa, COUNT(*)
  FROM tblABC
  GROUP BY aa
  ORDER BY aa
  FOR READ ONLY;
OPEN aaCountCursor;


FETCH NEXT FROM aaCountCursor INTO @aa, @aaCount
WHILE @@FETCH_STATUS = 0
BEGIN
    INSERT tblResults 
       SELECT TOP 5 aa, bb, cc
       FROM tblproh
       WHERE aa = @aa
       ORDER BY aa, bb DESC, cc

    FETCH NEXT FROM aaCountCursor INTO @aa, @aaCount;
END;

CLOSE aaCountCursor
DEALLOCATE aaCountCursor

SELECT * from tblResults
ORDER BY aa, bb, cc    -- OR .. bb DESC ... for a more traditional order.

On the question of needing an index or not. (cf OP's remark) When merely running a "SELECT * FROM myTable", a table scan is effectively the fastest appraoch, no need to bother with indexes. However, the main reason why SQL is typically better suited for this kind of things (aside from being the repository where the data has been accumulating in the first place, whereas any external solution needs to account for the time to export the relevant data), is that it can rely on indexes to avoid scanning. Many general purpose languages are far better suited to handle raw processing, but they are fighting an unfair battle with SQL because they need to rebuilt any prior knowledge of the data which SQL has gathered in the course of its data collection / import phase. Since sorting is a typically a time and sometimes space consuming task, SQL and its relatively slower processing power often ends up ahead of alternative solutions.

Also, even without pre-built indexes, modern query optimizers may decide on a plan that involves the creation of a temporary index. And, because sorting is an intrinsic part of DDMS, the SQL servers are generally efficient in that area.

So... Is SQL better?

This said, if we are trying to compare SQL and other languages for pure ETL jobs, i.e. for dealing with heaps (unindexed tables) as its input to perform various transformations and filtering, it is likely that multi-thread-able utilities written in say C, and leveraging efficient sorting libaries, would likely be faster. The determining question to decide on a SQL vs. Non-SQL approach is where the data is located and where should it eventually reside. If we merely to convert a file to be supplied down "the chain" external programs are better suited. If we have or need the data in a SQL server, there are only rare cases that make it worthwhile exporting and processing externally.