Mysql – Calculating the Median with Mysql

medianMySQLstatistics

I'm having trouble with calculating the median of a list of values, not the average.

I found this article
Simple way to calculate median with MySQL

It has a reference to the following query which I don't understand properly.

SELECT x.val from data x, data y
GROUP BY x.val
HAVING SUM(SIGN(1-SIGN(y.val-x.val))) = (COUNT(*)+1)/2

If I have a time column and I want to calculate the median value, what do the x and y columns refer to?

Best Answer

I propose a faster way.

Get the row count:

SELECT CEIL(COUNT(*)/2) FROM data;

Then take the middle value in a sorted subquery:

SELECT max(val) FROM (SELECT val FROM data ORDER BY val limit @middlevalue) x;

I tested this with a 5x10e6 dataset of random numbers and it will find the median in under 10 seconds.

This will find an arbitrary percentile by replacing the COUNT(*)/2 with COUNT(*)*n where n is the percentile (.5 for median, .75 for 75th percentile, etc).

Related Solutions

Mysql – Should I use the datetime or timestamp data type in MySQL

Timestamps in MySQL are generally used to track changes to records, and are often updated every time the record is changed. If you want to store a specific value you should use a datetime field.

If you meant that you want to decide between using a UNIX timestamp or a native MySQL datetime field, go with the native format. You can do calculations within MySQL that way ("SELECT DATE_ADD(my_datetime, INTERVAL 1 DAY)") and it is simple to change the format of the value to a UNIX timestamp ("SELECT UNIX_TIMESTAMP(my_datetime)") when you query the record if you want to operate on it with PHP.

“On-line” (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis

I use these incremental/recursive mean and median estimators, which both use constant storage:

mean += eta * (sample - mean)
median += eta * sgn(sample - median)

where eta is a small learning rate parameter (e.g. 0.001), and sgn() is the signum function which returns one of {-1, 0, 1}. (Use a constant eta if the data is non-stationary and you want to track changes over time; otherwise, for stationary sources you can use something like eta=1/n for the mean estimator, where n is the number of samples seen so far... unfortunately, this does not appear to work for the median estimator.)

This type of incremental mean estimator seems to be used all over the place, e.g. in unsupervised neural network learning rules, but the median version seems much less common, despite its benefits (robustness to outliers). It seems that the median version could be used as a replacement for the mean estimator in many applications.

I would love to see an incremental mode estimator of a similar form...

UPDATE

I just modified the incremental median estimator to estimate arbitrary quantiles. In general, a quantile function (http://en.wikipedia.org/wiki/Quantile_function) tells you the value that divides the data into two fractions: p and 1-p. The following estimates this value incrementally:

quantile += eta * (sgn(sample - quantile) + 2.0 * p - 1.0)

The value p should be within [0,1]. This essentially shifts the sgn() function's symmetrical output {-1,0,1} to lean toward one side, partitioning the data samples into two unequally-sized bins (fractions p and 1-p of the data are less than/greater than the quantile estimate, respectively). Note that for p=0.5, this reduces to the median estimator.

Best Answer

Related Solutions

Mysql – Should I use the datetime or timestamp data type in MySQL

“On-line” (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis

Related Topic