用Order By计算分区中的行

时间:2018-06-20 10:00:56

标签: sql postgresql window-functions

我试图通过编写一些示例查询来了解postgres中的PARTITION BY。我有一个用于运行查询的测试表。

id integer | num integer
___________|_____________
1          | 4 
2          | 4
3          | 5
4          | 6

运行以下查询时,得到的输出与预期的一样。

SELECT id, COUNT(id) OVER(PARTITION BY num) from test;

id         | count
___________|_____________
1          | 2 
2          | 2
3          | 1
4          | 1

但是,当我将ORDER BY添加到分区时,

SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;

id         | count
___________|_____________
1          | 1 
2          | 2
3          | 1
4          | 1

我的理解是,COUNT是在属于一个分区的所有行中计算的。在这里,我已按 num 对行进行了分区。不论是否带有ORDER BY子句,分区中的行数都是相同的。为什么输出会有所不同?

3 个答案:

答案 0 :(得分:5)

在将order by添加到用作窗口函数的聚合时,该聚合会变成“运行计数”(或您使用的任何聚合)。

count(*)将根据指定的顺序返回直到“当前行”为止的行数。

以下查询显示与order by一起使用的聚合的不同结果。用sum()代替count(),(在我看来)比较容易看到。

with test (id, num, x) as (
  values 
    (1, 4, 1),
    (2, 4, 1),
    (3, 5, 2),
    (4, 6, 2)
)
select id, 
       num,
       x,
       count(*) over () as total_rows, 
       count(*) over (order by id) as rows_upto,
       count(*) over (partition by x order by id) as rows_per_x,
       sum(num) over (partition by x) as total_for_x,
       sum(num) over (order by id) as sum_upto,
       sum(num) over (partition by x order by id) as sum_for_x_upto
from test;

将导致:

id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
 1 |   4 | 1 |          4 |         1 |          1 |           8 |        4 |              4
 2 |   4 | 1 |          4 |         2 |          2 |           8 |        8 |              8
 3 |   5 | 2 |          4 |         3 |          1 |          11 |       13 |              5
 4 |   6 | 2 |          4 |         4 |          2 |          11 |       19 |             11

Postgres manual

中有更多示例

答案 1 :(得分:2)

您的两个表达式是:

COUNT(id) OVER (PARTITION BY num)

COUNT(id) OVER (PARTITION BY num ORDER BY id)

您为什么希望它们返回相同的值?语法不同是有原因的。

第一个返回每个num的总数-本质上是将合计值重新加入。

秒进行累加计数。对于COUNT()的每一行,它都对id的每一行执行id

请注意,此类累积计数通常使用RANK()(或相关功能)来实现。  累积计数与RANK()略有不同。累计计数实现:

COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

RANK()略有不同。仅当ORDER BY键绑定时,差异才重要。

答案 2 :(得分:1)

其他人已经解释了“为什么”。有时您有一个有序的窗口,尽管有ORDER BY,但您还是必须对整个分区进行计数。

要这样做,请将unbounded rangeRANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING一起使用

create table search_log
(
    id bigint not null primary key,
    query varchar(255) not null,
    stemmed_query varchar(255) not null,
    created timestamp not null,
);

SELECT query,
       created as seen_on,
       first_value(created) OVER query_window as last_seen,
       row_number() OVER query_window AS rn,
       count(*) OVER query_window AS occurence
FROM search_log l
     WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC 
         RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)