MySQL: Total GROUP BY WITH ROLLUP curiosity

MySQL

I have two queries. One of them makes sense to me, the other don't. First one:

SELECT gender AS 'Gender', count(*) AS '#'
    FROM registrations 
    GROUP BY gender WITH ROLLUP

That gives me this:

Gender       #
Female      20
Male        19
NULL        39

So, I get the count, and the total count. What I expected. Next one:

SELECT c.printable_name AS 'Country', count(*) AS '#' 
    FROM registrations r 
    INNER JOIN country c ON r.country = c.country_id 
    GROUP BY country WITH ROLLUP

Country         #
Denmark         9
Norway         10
Sweden         18
United States   1
Uzbekistan      1
Uzbekistan     39

Same result. But why do I get Uzbekistan for the total??

Best Answer

But why do I get Uzbekistan for the total??

Because you're not SELECTing the item that you're GROUPing BY. If you said:

GROUP BY c.printable_name

You'd get the expected NULL. However you're grouping by a different column so MySQL doesn't know that printable_name is taking part in a rollup-group, and selects any old value from that column, in the join of all registrations. (So it is possible you will see other countries than Uzbekistan.)

This is part of a wider problem with MySQL being permissive on what you can SELECT in a GROUP BY query. For example, you can say:

SELECT gender FROM registrations GROUP BY country;

and MySQL will happily pick one of the gender values for a registration from each country, even though there is no direct causal link (aka “functional dependency”) between country and gender. Other DBMSs will refuse the above command on the grounds that there isn't guaranteed to be one gender per country.(*)

Now, this:

SELECT c.printable_name AS 'Country', count(*) AS '#' 
FROM registrations r 
INNER JOIN country c ON r.country = c.country_id 
GROUP BY country

is OK, because there's a functional dependency between r.country and c.printable_name (assuming you have correctly described your country_id as a PRIMARY KEY).

However MySQL's WITH ROLLUP extension is a bit of a hack in the way it works. On the rollup row stage at the end, it runs over the entire pre-grouping result set to grab its values, and then sets the group-by column to NULL. It doesn't also null other columns that have a functional dependency on that column. It probably should, but MySQL currently doesn't really understand the whole thing about functional dependencies.

So if you select c.printable_name it will show you whichever country name value it randomly picked, and if you select c.country_id it will show you whichever country ID it randomly picked — even though c.country_id is the join criterion, so must be the same as r.country, which is NULL!

What you can do to work around the problem is:

  • group by printable_name instead; should be OK if printable_names are unique, or
  • select “r.country” as well as printable_name, and check that for being NULL, or
  • forget WITH ROLLUP and do a separate query for the end sum. This will be a little slower but it will also be ANSI SQL-92 compliant so your app could work on other databases.

(*: MySQL has an SQL_MODE option ONLY_FULL_GROUP_BY that is supposed to address this issue, but it goes much too far and only lets you select columns from the GROUP BY, not columns that have a functional dependency on the GROUP BY. So it will make valid queries fail as well, making it generally useless.)