Sql – must appear in the GROUP BY clause or be used in an aggregate function

aggregate-functionsgroup-bypostgresql-9.1sql

I have a table that looks like this caller 'makerar'

 cname  | wmname |          avg           
--------+-------------+------------------------
 canada | zoro   |     2.0000000000000000
 spain  | luffy  | 1.00000000000000000000
 spain  | usopp  |     5.0000000000000000

And I want to select the maximum avg for each cname.

SELECT cname, wmname, MAX(avg)  FROM makerar GROUP BY cname;

but I will get an error,

ERROR:  column "makerar.wmname" must appear in the GROUP BY clause or be used in an   aggregate function 
LINE 1: SELECT cname, wmname, MAX(avg)  FROM makerar GROUP BY cname;

so i do this

SELECT cname, wmname, MAX(avg)  FROM makerar GROUP BY cname, wmname;

however this will not give the intented results, and the incorrect output below is shown.

 cname  | wmname |          max           
--------+--------+------------------------
 canada | zoro   |     2.0000000000000000
 spain  | luffy  | 1.00000000000000000000
 spain  | usopp  |     5.0000000000000000

Actual Results should be

 cname  | wmname |          max           
--------+--------+------------------------
 canada | zoro   |     2.0000000000000000
 spain  | usopp  |     5.0000000000000000

How can I go about fixing this issue?

Note: This table is a VIEW created from a previous operation.

Best Answer

Yes, this is a common aggregation problem. Before SQL3 (1999), the selected fields must appear in the GROUP BY clause[*].

To workaround this issue, you must calculate the aggregate in a sub-query and then join it with itself to get the additional columns you'd need to show:

SELECT m.cname, m.wmname, t.mx
FROM (
    SELECT cname, MAX(avg) AS mx
    FROM makerar
    GROUP BY cname
    ) t JOIN makerar m ON m.cname = t.cname AND t.mx = m.avg
;

 cname  | wmname |          mx           
--------+--------+------------------------
 canada | zoro   |     2.0000000000000000
 spain  | usopp  |     5.0000000000000000

But you may also use window functions, which looks simpler:

SELECT cname, wmname, MAX(avg) OVER (PARTITION BY cname) AS mx
FROM makerar
;

The only thing with this method is that it will show all records (window functions do not group). But it will show the correct (i.e. maxed at cname level) MAX for the country in each row, so it's up to you:

 cname  | wmname |          mx           
--------+--------+------------------------
 canada | zoro   |     2.0000000000000000
 spain  | luffy  |     5.0000000000000000
 spain  | usopp  |     5.0000000000000000

The solution, arguably less elegant, to show the only (cname, wmname) tuples matching the max value, is:

SELECT DISTINCT /* distinct here matters, because maybe there are various tuples for the same max value */
    m.cname, m.wmname, t.avg AS mx
FROM (
    SELECT cname, wmname, avg, ROW_NUMBER() OVER (PARTITION BY avg DESC) AS rn 
    FROM makerar
) t JOIN makerar m ON m.cname = t.cname AND m.wmname = t.wmname AND t.rn = 1
;


 cname  | wmname |          mx           
--------+--------+------------------------
 canada | zoro   |     2.0000000000000000
 spain  | usopp  |     5.0000000000000000

[*]: Interestingly enough, even though the spec sort of allows to select non-grouped fields, major engines seem to not really like it. Oracle and SQLServer just don't allow this at all. Mysql used to allow it by default, but now since 5.7 the administrator needs to enable this option (ONLY_FULL_GROUP_BY) manually in the server configuration for this feature to be supported...

Related Solutions

Mysql – Retrieving the last record in each group – MySQL

MySQL 8.0 now supports windowing functions, like almost all popular SQL implementations. With this standard syntax, we can write greatest-n-per-group queries:

WITH ranked_messages AS (
  SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn
  FROM messages AS m
)
SELECT * FROM ranked_messages WHERE rn = 1;

Below is the original answer I wrote for this question in 2009:

I write the solution this way:

SELECT m1.*
FROM messages m1 LEFT JOIN messages m2
 ON (m1.name = m2.name AND m1.id < m2.id)
WHERE m2.id IS NULL;

Regarding performance, one solution or the other can be better, depending on the nature of your data. So you should test both queries and use the one that is better at performance given your database.

For example, I have a copy of the StackOverflow August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.

I'll write a query to find the most recent post for a given user ID (mine).

First using the technique shown by @Eric with the GROUP BY in a subquery:

SELECT p1.postid
FROM Posts p1
INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid
            FROM Posts pi GROUP BY pi.owneruserid) p2
  ON (p1.postid = p2.maxpostid)
WHERE p1.owneruserid = 20860;

1 row in set (1 min 17.89 sec)

Even the EXPLAIN analysis takes over 16 seconds:

+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
| id | select_type | table      | type   | possible_keys              | key         | key_len | ref          | rows    | Extra       |
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
|  1 | PRIMARY     | <derived2> | ALL    | NULL                       | NULL        | NULL    | NULL         |   76756 |             | 
|  1 | PRIMARY     | p1         | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY     | 8       | p2.maxpostid |       1 | Using where | 
|  2 | DERIVED     | pi         | index  | NULL                       | OwnerUserId | 8       | NULL         | 1151268 | Using index | 
+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+
3 rows in set (16.09 sec)

Now produce the same query result using my technique with LEFT JOIN:

SELECT p1.postid
FROM Posts p1 LEFT JOIN posts p2
  ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid)
WHERE p2.postid IS NULL AND p1.owneruserid = 20860;

1 row in set (0.28 sec)

The EXPLAIN analysis shows that both tables are able to use their indexes:

+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
| id | select_type | table | type | possible_keys              | key         | key_len | ref   | rows | Extra                                |
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
|  1 | SIMPLE      | p1    | ref  | OwnerUserId                | OwnerUserId | 8       | const | 1384 | Using index                          | 
|  1 | SIMPLE      | p2    | ref  | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8       | const | 1384 | Using where; Using index; Not exists | 
+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+
2 rows in set (0.00 sec)

Here's the DDL for my Posts table:

CREATE TABLE `posts` (
  `PostId` bigint(20) unsigned NOT NULL auto_increment,
  `PostTypeId` bigint(20) unsigned NOT NULL,
  `AcceptedAnswerId` bigint(20) unsigned default NULL,
  `ParentId` bigint(20) unsigned default NULL,
  `CreationDate` datetime NOT NULL,
  `Score` int(11) NOT NULL default '0',
  `ViewCount` int(11) NOT NULL default '0',
  `Body` text NOT NULL,
  `OwnerUserId` bigint(20) unsigned NOT NULL,
  `OwnerDisplayName` varchar(40) default NULL,
  `LastEditorUserId` bigint(20) unsigned default NULL,
  `LastEditDate` datetime default NULL,
  `LastActivityDate` datetime default NULL,
  `Title` varchar(250) NOT NULL default '',
  `Tags` varchar(150) NOT NULL default '',
  `AnswerCount` int(11) NOT NULL default '0',
  `CommentCount` int(11) NOT NULL default '0',
  `FavoriteCount` int(11) NOT NULL default '0',
  `ClosedDate` datetime default NULL,
  PRIMARY KEY  (`PostId`),
  UNIQUE KEY `PostId` (`PostId`),
  KEY `PostTypeId` (`PostTypeId`),
  KEY `AcceptedAnswerId` (`AcceptedAnswerId`),
  KEY `OwnerUserId` (`OwnerUserId`),
  KEY `LastEditorUserId` (`LastEditorUserId`),
  KEY `ParentId` (`ParentId`),
  CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`)
) ENGINE=InnoDB;

Note to commenters: If you want another benchmark with a different version of MySQL, a different dataset, or different table design, feel free to do it yourself. I have shown the technique above. Stack Overflow is here to show you how to do software development work, not to do all the work for you.

Best Answer

Related Solutions

Mysql – Retrieving the last record in each group – MySQL

Related Topic