R – Perl statistics package that doesn’t make me load the entire dataset at once

memoryperlstatistics

I'm looking for a statistics package for Perl (CPAN is fine) that allows me to add data incrementally instead of having to pass in an entire array of data.

Just the mean, median, stddev, max, and min is necessary, nothing too complicated.

The reason for this is because my dataset is entirely too large to fit into memory. The data source is in a MySQL database, so right now I'm just querying a subset of the data and computing the statistics for them, then combining all the manageable subsets later.

If you have other ideas on how to overcome this issue, I'd be much obliged!

Best Answer

You can not do an exact stddev and a median unless you either keep the whole thing in memory or run through the data twice.

UPDATE While you can not do an exact stddev IN ONE PASS, there's an approximation one-pass algorithm, the link is in a comment to this answer.

The rest are completely trivial (no need for a module) to do in 3-5 lines of Perl. STDDEV/Median can be done in 2 passes fairly trivially as well (I just rolled out a script that did exactly what you described, but for IP reasons I'm pretty sure I'm not allowed to post it as example for you, sorry)

Sample code:

my ($min, $max)
my $sum = 0;
my $count = 0;
while (<>) {
    chomp;
    my $current_value = $_; #assume input is 1 value/line for simplicity sake
    $sum += $current_value;
    $count++;
    $min = $current_value if (!defined $min || $min > $current_value);
    $max = $current_value if (!defined $max || $max < $current_value);
}
my $mean = $sum * 1.0 / $count;
my $sum_mean_diffs_2 = 0;

while (<>) { # Second pass to compute stddev (use for median too)
    chomp;
    my $current_value = $_; 
    $sum_mean_diffs += ($current_value - $mean) * ($current_value - $mean);
}
my $std_dev = sqrt($sum_mean_diffs / $count);
# Median is left as excercise for the reader.
Related Topic