基准

Question

正如标题所示，我想选择与GROUP BY分组的每组行的第一行。

具体来说，如果我有一个purchases表，如下所示：

SELECT * FROM purchases;

我的输出：

id | customer | total
---+----------+------
 1 | Joe      | 5
 2 | Sally    | 3
 3 | Joe      | 2
 4 | Sally    | 1

我想查询每个id所做的最大购买（total）的customer。像这样：

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY total DESC;

预期输出：

FIRST(id) | customer | FIRST(total)
----------+----------+-------------
        1 | Joe      | 5
        2 | Sally    | 3

Answer 1

在 PostgreSQL 中，这通常更简单，更快（下面有更多性能优化）：

SELECT DISTINCT ON (customer)
       id, customer, total
FROM   purchases
ORDER  BY customer, total DESC, id;

或者更短（如果不是很清楚）具有序数的输出列：

SELECT DISTINCT ON (2)
       id, customer, total
FROM   purchases
ORDER  BY 2, 3 DESC, 1;

如果total可以为NULL（不管怎样都不会受到伤害，但您想要匹配现有索引）：

...
ORDER  BY customer, total DESC NULLS LAST, id;

重点

DISTINCT ON是标准的PostgreSQL扩展（在整个DISTINCT列表中只定义了SELECT）。
列出DISTINCT ON子句中的任意数量的表达式，组合的行值定义重复项。 The manual:

显然，如果它们至少不同，则认为两行是不同的一列值。 此比较中的空值被视为相同。

大胆强调我的。
DISTINCT ON可与 ORDER BY 结合使用。前导表达式必须以相同的顺序匹配前导DISTINCT ON表达式。您可以将其他表达式添加到ORDER BY，以从每个对等组中选择一个特定行。我添加id作为最后一项来打破关系：

＆＃34;从共享最高id的每个组中挑选最小total的行。＆＃34;

要以不同于确定每个组的第一个排序顺序的方式对结果进行排序，您可以使用另一个ORDER BY将查询嵌套在外部查询中。像：
- PostgreSQL DISTINCT ON with different ORDER BY
如果total可以为NULL，则很可能想要具有最大非空值的行。像演示一样添加 NULLS LAST 。详细说明：
- PostgreSQL sort by datetime asc, null first?
SELECT列表不受任何方式DISTINCT ON或ORDER BY中的表达式约束。（在上面的简单案例中不需要）：
- 您不必包含DISTINCT ON或ORDER BY中的任何表达。
- 您可以在SELECT列表中包含任何其他表达式。这有助于用子查询和聚合/窗口函数替换更复杂的查询。
我使用Postgres版本8.3 - 12进行了测试。但是至少从版本7.1开始，该功能一直存在，所以基本上总是如此。

索引

上述查询的完美索引将是multi-column index，它匹配序列中的所有三列并具有匹配的排序顺序：

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

可能太专业了。但是，如果特定查询的读取性能至关重要，请使用它。如果查询中有DESC NULLS LAST，请在索引中使用相同的内容，以便排序顺序匹配且索引适用。

有效性/性能优化

在为每个查询创建定制索引之前，权衡成本和收益。上述指数的潜力很大程度上取决于数据分布。

使用索引是因为它提供了预先排序的数据。在Postgres 9.2或更高版本中，如果索引小于基础表，查询也可以从 index only scan 中受益。但是，索引必须完整扫描。

对于每个客户的 少数行（列customer中的高基数），这非常有效。如果你还需要分类输出，那就更是如此了。每个客户的行数越来越多，收益越来越大理想情况下，您有足够的work_mem来处理RAM中涉及的排序步骤而不会溢出到磁盘。但通常将work_mem 设置为可能会产生负面影响。对于异常大的查询，请考虑SET LOCAL。通过EXPLAIN ANALYZE查找您需要的数量。提及＆＃34; 磁盘：＆＃34;在排序步骤中表明需要更多：
- Configuration parameter work_mem in PostgreSQL on Linux
- Optimize simple query using ORDER BY date and text
对于每个客户的 多个行（列customer中的基数较低），loose index scan（又名＆＃34;跳过扫描＆＃34;）（更高效），但是没有实现到Postgres 11.（Postgres 12 13正在开发仅索引扫描的实现。 here和here。）
目前，有更快的查询技术来替代它。特别是如果您有一个单独的表，其中包含唯一的客户，这是典型的用例。但如果你不这样做：
- 的 Optimize GROUP BY query to retrieve latest row per user

基准

我在这里有一个简单的基准，现在已经过时了。我用detailed benchmark in this separate answer替换了它。

Answer 2

在Oracle 9.2+上（不是最初提到的8i +），SQL Server 2005 +，PostgreSQL 8.4 +，DB2，Firebird 3.0 +，Teradata，Sybase，Vertica：

WITH summary AS (
    SELECT p.id, 
           p.customer, 
           p.total, 
           ROW_NUMBER() OVER(PARTITION BY p.customer 
                                 ORDER BY p.total DESC) AS rk
      FROM PURCHASES p)
SELECT s.*
  FROM summary s
 WHERE s.rk = 1

受任何数据库支持：

但你需要添加逻辑来打破关系：

  SELECT MIN(x.id),  -- change to MAX if you want the highest
         x.customer, 
         x.total
    FROM PURCHASES x
    JOIN (SELECT p.customer,
                 MAX(total) AS max_total
            FROM PURCHASES p
        GROUP BY p.customer) y ON y.customer = x.customer
                              AND y.max_total = x.total
GROUP BY x.customer, x.total

Answer 3

基准

使用Postgres 9.4 和 9.5 测试最有趣的候选人，并在purchases和<{1}}中使用 200k行的中间表强> 10k不同customer_id （每个客户20行）。

对于Postgres 9.5，我有效地为86446个不同的客户进行了第二次测试。请参阅下文（平均每个客户2.3行）。

设置

主表

CREATE TABLE purchases (
  id          serial
, customer_id int  -- REFERENCES customer
, total       int  -- could be amount of money in Cent
, some_column text -- to make the row bigger, more realistic
);

我使用serial（下面添加了PK约束）和整数customer_id，因为这是一个更典型的设置。还添加了some_column以弥补通常更多的列。

虚拟数据，PK，索引 - 一个典型的表也有一些死元组：

INSERT INTO purchases (customer_id, total, some_column)    -- insert 200k rows
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,200000) g;

ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id);

DELETE FROM purchases WHERE random() > 0.9; -- some dead rows

INSERT INTO purchases (customer_id, total, some_column)
SELECT (random() * 10000)::int             AS customer_id  -- 10k customers
     , (random() * random() * 100000)::int AS total     
     , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int)
FROM   generate_series(1,20000) g;  -- add 20k to make it ~ 200k

CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id);

VACUUM ANALYZE purchases;

customer表 - 用于高级查询

CREATE TABLE customer AS
SELECT customer_id, 'customer_' || customer_id AS customer
FROM   purchases
GROUP  BY 1
ORDER  BY 1;

ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id);

VACUUM ANALYZE customer;

在9.5的第二次测试中，我使用相同的设置，但使用random() * 100000生成customer_id，每customer_id只获得几行。

表`purchases`

的对象大小

使用this query生成。

               what                | bytes/ct | bytes_pretty | bytes_per_row
-----------------------------------+----------+--------------+---------------
 core_relation_size                | 20496384 | 20 MB        |           102
 visibility_map                    |        0 | 0 bytes      |             0
 free_space_map                    |    24576 | 24 kB        |             0
 table_size_incl_toast             | 20529152 | 20 MB        |           102
 indexes_size                      | 10977280 | 10 MB        |            54
 total_size_incl_toast_and_indexes | 31506432 | 30 MB        |           157
 live_rows_in_text_representation  | 13729802 | 13 MB        |            68
 ------------------------------    |          |              |
 row_count                         |   200045 |              |
 live_tuples                       |   200045 |              |
 dead_tuples                       |    19955 |              |

查询

1。 CTE中的`row_number()`，（see other answer）

WITH cte AS (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   )
SELECT id, customer_id, total
FROM   cte
WHERE  rn = 1;

2。子查询中的`row_number()`（我的优化）

SELECT id, customer_id, total
FROM   (
   SELECT id, customer_id, total
        , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn
   FROM   purchases
   ) sub
WHERE  rn = 1;

3。 `DISTINCT ON`（see other answer）

SELECT DISTINCT ON (customer_id)
       id, customer_id, total
FROM   purchases
ORDER  BY customer_id, total DESC, id;

4。带有`LATERAL`子查询（see here）

的rCTE

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT id, customer_id, total
   FROM   purchases
   ORDER  BY customer_id, total DESC
   LIMIT  1
   )
   UNION ALL
   SELECT u.*
   FROM   cte c
   ,      LATERAL (
      SELECT id, customer_id, total
      FROM   purchases
      WHERE  customer_id > c.customer_id  -- lateral reference
      ORDER  BY customer_id, total DESC
      LIMIT  1
      ) u
   )
SELECT id, customer_id, total
FROM   cte
ORDER  BY customer_id;

5。带有`customer`（see here）

的LATERAL表格

SELECT l.*
FROM   customer c
,      LATERAL (
   SELECT id, customer_id, total
   FROM   purchases
   WHERE  customer_id = c.customer_id  -- lateral reference
   ORDER  BY total DESC
   LIMIT  1
   ) l;

6。 `array_agg()`与`ORDER BY`（see other answer）

SELECT (array_agg(id ORDER BY total DESC))[1] AS id
     , customer_id
     , max(total) AS total
FROM   purchases
GROUP  BY customer_id;

结果

上述查询的执行时间EXPLAIN ANALYZE（以及所有选项关闭），最好5次。

所有查询都使用purchases2_3c_idx上的仅索引扫描（以及其他步骤）。其中一些只是为了较小的索引，有些则更有效。

一种。 Postgres 9.4有200k行，每个`customer_id`

约20

1. 273.274 ms  
2. 194.572 ms  
3. 111.067 ms  
4.  92.922 ms  
5.  37.679 ms  -- winner
6. 189.495 ms

B中。与Postgres 9.5相同

1. 288.006 ms
2. 223.032 ms  
3. 107.074 ms  
4.  78.032 ms  
5.  33.944 ms  -- winner
6. 211.540 ms

℃。与B.相同，但每`customer_id`

约为2.3行

1. 381.573 ms
2. 311.976 ms
3. 124.074 ms  -- winner
4. 710.631 ms
5. 311.976 ms
6. 421.679 ms

2011年的原始（过时）基准

我使用PostgreSQL 9.1 在65579行的实际生命表上运行了三次测试，并在所涉及的三列中的每一列上执行了单列btree索引，并且获得了最佳的执行时间 5次运行将@OMGPonies'第一个查询（ A ）与above DISTINCT ON solution（ B ）进行比较：

选择整个表格，在这种情况下会产生5958行。
```
A: 567.218 ms
B: 386.673 ms
```
使用条件WHERE customer BETWEEN x AND y生成1000行。
```
A: 249.136 ms
B:  55.111 ms
```
选择WHERE customer = x的单个客户。
```
A:   0.143 ms
B:   0.072 ms
```

使用其他答案中描述的索引重复相同的测试

CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);

1A: 277.953 ms  
1B: 193.547 ms

2A: 249.796 ms -- special index not used  
2B:  28.679 ms

3A:   0.120 ms  
3B:   0.048 ms

Answer 4

这是常见的greatest-n-per-group问题，已经过很好的测试和高度optimized solutions。我个人更喜欢left join solution by Bill Karwin（original post with lots of other solutions）。

请注意，对于这个常见问题的解决方案可以在大多数官方消息来源中找到， MySQL手册！请参阅Examples of Common Queries :: The Rows Holding the Group-wise Maximum of a Certain Column。

Answer 5

在Postgres中，您可以像这样使用array_agg：

SELECT  customer,
        (array_agg(id ORDER BY total DESC))[1],
        max(total)
FROM purchases
GROUP BY customer

这将为您提供每个客户最大购买量的id。

有些注意事项：

array_agg是一个聚合函数，因此适用于GROUP BY。
array_agg允许您指定一个仅限于自身的排序，因此它不会限制整个查询的结构。如果您需要执行与默认值不同的操作，还有关于如何对NULL进行排序的语法。
一旦我们构建了数组，我们就会采用第一个元素。（Postgres数组是1索引的，而不是0索引的。）
您可以以类似的方式对第三个输出列使用array_agg，但max(total)更简单。
与DISTINCT ON不同，使用array_agg可以保留GROUP BY，以防出于其他原因。

Answer 6

由于存在SubQs

，Erwin指出解决方案效率不高

select * from purchases p1 where total in
(select max(total) from purchases where p1.customer=customer) order by total desc;

Answer 7

我使用这种方式（仅限postgresql）：https://wiki.postgresql.org/wiki/First/last_%28aggregate%29

-- Create a function that always returns the first non-NULL item
CREATE OR REPLACE FUNCTION public.first_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $1;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.first (
        sfunc    = public.first_agg,
        basetype = anyelement,
        stype    = anyelement
);

-- Create a function that always returns the last non-NULL item
CREATE OR REPLACE FUNCTION public.last_agg ( anyelement, anyelement )
RETURNS anyelement LANGUAGE sql IMMUTABLE STRICT AS $$
        SELECT $2;
$$;

-- And then wrap an aggregate around it
CREATE AGGREGATE public.last (
        sfunc    = public.last_agg,
        basetype = anyelement,
        stype    = anyelement
);

然后你的例子应该按原样运行：

SELECT FIRST(id), customer, FIRST(total)
FROM  purchases
GROUP BY customer
ORDER BY FIRST(total) DESC;

CAVEAT：它忽略了“空行”

编辑1 - 改为使用postgres扩展名

现在我用这种方式：http://pgxn.org/dist/first_last_agg/

在ubuntu 14.04上安装：

apt-get install postgresql-server-dev-9.3 git build-essential -y
git clone git://github.com/wulczer/first_last_agg.git
cd first_last_app
make && sudo make install
psql -c 'create extension first_last_agg'

它是一个postgres扩展，为您提供第一个和最后一个功能;显然比上述方式更快。</ p>

编辑2 - 订购和过滤

如果您使用聚合函数（如这些），您可以订购结果，而无需订购数据：

http://www.postgresql.org/docs/current/static/sql-expressions.html#SYNTAX-AGGREGATES

所以等效的例子，如下所示：

SELECT first(id order by id), customer, first(total order by id)
  FROM purchases
 GROUP BY customer
 ORDER BY first(total);

当然，您可以按照您认为适合的方式订购和过滤;它非常强大的语法。

Answer 8

非常快速的解决方案

SELECT a.* 
FROM
    purchases a 
    JOIN ( 
        SELECT customer, min( id ) as id 
        FROM purchases 
        GROUP BY customer 
    ) b USING ( id );

如果表由id：

索引，则非常快

create index purchases_id on purchases (id);

Answer 9

查询：

SELECT purchases.*
FROM purchases
LEFT JOIN purchases as p 
ON 
  p.customer = purchases.customer 
  AND 
  purchases.total < p.total
WHERE p.total IS NULL

如何运作！（我去过那里）

我们希望确保每次购买的总数最高。

一些理论资料（如果您只想了解查询，请跳过此部分）

让Total为函数T（customer，id），返回给定name和id的值为了证明给定的总数（T（客户，身份证））是最高的，我们必须证明这一点我们要证明

∀xT（customer，id）＆gt; T（客户，x）（这个总数高于其他所有该客户的总数）

OR

¬∃xT（customer，id）＆lt; T（客户，x）（没有更高的总数那个客户）

第一种方法需要我们获取我不喜欢的那个名字的所有记录。

第二个需要一个明智的方式来说没有比这个更高的记录。

返回SQL

如果我们在名称上加入表格，且总数小于联合表格：

      LEFT JOIN purchases as p 
      ON 
      p.customer = purchases.customer 
      AND 
      purchases.total < p.total

我们确保所有具有相同用户总数较高的记录的记录加入：

purchases.id, purchases.customer, purchases.total, p.id, p.customer, p.total
1           , Tom           , 200             , 2   , Tom   , 300
2           , Tom           , 300
3           , Bob           , 400             , 4   , Bob   , 500
4           , Bob           , 500
5           , Alice         , 600             , 6   , Alice   , 700
6           , Alice         , 700

这将有助于我们过滤每次购买的最高总额，而无需分组：

WHERE p.total IS NULL

purchases.id, purchases.name, purchases.total, p.id, p.name, p.total
2           , Tom           , 300
4           , Bob           , 500
6           , Alice         , 700

这就是我们需要的答案。

Answer 10

公认的OMG小马＆＃39; ＆＃34;任何数据库都支持＆＃34;解决方案从我的测试中获得了很好的速度。

在这里，我提供了一种相同的方法，但是更完整，更干净的任何数据库解决方案。考虑关系（假设每个客户只想获得一行，甚至每个客户的最大总数为多个记录），并且将为购买表中的实际匹配行选择其他购买字段（例如purchase_payment_id）。

任何数据库都支持：

select * from purchase
join (
    select min(id) as id from purchase
    join (
        select customer, max(total) as total from purchase
        group by customer
    ) t1 using (customer, total)
    group by customer
) t2 using (id)
order by customer

此查询相当快，特别是当购买表上有（客户，总计）等综合索引时。

注：

t1，t2是子查询别名，可根据数据库删除。
警告：2017年1月编辑时，MS-SQL和Oracle数据库中不支持using (...)子句。您必须自己将其扩展为例如on t2.id = purchase.id等USING语法适用于SQLite，MySQL和PostgreSQL。

Answer 11

在SQL Server中，您可以执行以下操作：

SELECT *
FROM (
SELECT ROW_NUMBER()
OVER(PARTITION BY customer
ORDER BY total DESC) AS StRank, *
FROM Purchases) n
WHERE StRank = 1

说明：在此分组依据是根据客户进行的，然后按总数进行订购，然后为每个这样的组指定序列号为StRank，我们将取出第一个有1个StRank为1的客户

Answer 12

对PostgreSQL，U-SQL，IBM DB2和Google BigQuery SQL使用ARRAY_AGG函数：

SELECT customer, (ARRAY_AGG(id ORDER BY total DESC))[1], MAX(total)
FROM purchases
GROUP BY customer

Answer 13

在PostgreSQL中，另一种可能性是结合使用first_value窗口函数和SELECT DISTINCT：

select distinct customer_id,
                first_value(row(id, total)) over(partition by customer_id order by total desc, id)
from            purchases;

我创建了一个复合(id, total)，所以两个值都由同一聚合返回。您当然可以始终两次应用first_value()。

Answer 14

如果要从聚合行集中选择任何（根据您的特定条件）行。
如果要使用除sum/avg之外的另一个（max/min）聚合函数。因此，您不能通过DISTINCT ON

您可以使用下一个子查询：

SELECT  
    (  
       SELECT **id** FROM t2   
       WHERE id = ANY ( ARRAY_AGG( tf.id ) ) AND amount = MAX( tf.amount )   
    ) id,  
    name,   
    MAX(amount) ma,  
    SUM( ratio )  
FROM t2  tf  
GROUP BY name

您可以将amount = MAX( tf.amount )替换为任何您需要的条件（只有一个限制）：此子查询不得返回多于一行的内容

但是，如果您想做这样的事情，您可能会寻找window functions

Answer 15

对于SQl Server，最有效的方法是：

with
ids as ( --condition for split table into groups
    select i from (values (9),(12),(17),(18),(19),(20),(22),(21),(23),(10)) as v(i) 
) 
,src as ( 
    select * from yourTable where  <condition> --use this as filter for other conditions
)
,joined as (
    select tops.* from ids 
    cross apply --it`s like for each rows
    (
        select top(1) * 
        from src
        where CommodityId = ids.i 
    ) as tops
)
select * from joined

不要忘记为使用过的列创建聚簇索引

Answer 16

Snowflake / Teradata支持QUALIFY子句，该子句对窗口函数的作用类似于info [at] abcd.com info@abcd [dot] com info [at] abcd [dot] com INFO [ AT ] ABCD[ DOT ]COM：

HAVING

Answer 17

这样对我有用：

SELECT article, dealer, price
FROM   shop s1
WHERE  price=(SELECT MAX(s2.price)
              FROM shop s2
              WHERE s1.article = s2.article
              GROUP BY s2.article)
ORDER BY article;

选择每篇文章的最高价格

Answer 18

我通过窗口函数 dbfiddle 的方法：

为每组分配 .nb-theme-corporate nb-layout-header.fixed ~ .layout-container{ background:url(your-url); }：row_number
只占第一排：row_number() over (partition by agreement_id, order_id ) as nrow

filter (where nrow = 1)

设置

表`purchases`

查询

1。 CTE中的`row_number()`，（see other answer）

2。子查询中的`row_number()`（我的优化）

3。 `DISTINCT ON`（see other answer）

4。带有`LATERAL`子查询（see here）

5。带有`customer`（see here）

6。 `array_agg()`与`ORDER BY`（see other answer）

结果

一种。 Postgres 9.4有200k行，每个`customer_id`

B中。与Postgres 9.5相同

℃。与B.相同，但每`customer_id`

2011年的原始（过时）基准

选择每个GROUP BY组中的第一行？

18 个答案:

重点

索引

有效性/性能优化

基准

在Oracle 9.2+上（不是最初提到的8i +），SQL Server 2005 +，PostgreSQL 8.4 +，DB2，Firebird 3.0 +，Teradata，Sybase，Vertica：

受任何数据库支持：

基准

编辑1 - 改为使用postgres扩展名

编辑2 - 订购和过滤

选择每个GROUP BY组中的第一行？

18 个答案:

重点

索引

有效性/性能优化

基准

在Oracle 9.2+上（不是最初提到的8i +），SQL Server 2005 +，PostgreSQL 8.4 +，DB2，Firebird 3.0 +，Teradata，Sybase，Vertica：

受任何数据库支持：

基准

设置

表purchases

查询

1。 CTE中的row_number()，（see other answer）

2。子查询中的row_number()（我的优化）

3。 DISTINCT ON（see other answer）

4。带有LATERAL子查询（see here）

5。带有customer（see here）

6。 array_agg()与ORDER BY（see other answer）

结果

一种。 Postgres 9.4有200k行，每个customer_id

B中。与Postgres 9.5相同

℃。与B.相同，但每customer_id

2011年的原始（过时）基准

编辑1 - 改为使用postgres扩展名

编辑2 - 订购和过滤

表`purchases`

1。 CTE中的`row_number()`，（see other answer）

2。子查询中的`row_number()`（我的优化）

3。 `DISTINCT ON`（see other answer）

4。带有`LATERAL`子查询（see here）

5。带有`customer`（see here）

6。 `array_agg()`与`ORDER BY`（see other answer）

一种。 Postgres 9.4有200k行，每个`customer_id`

℃。与B.相同，但每`customer_id`