我试图通过编写一些示例查询来了解postgres中的PARTITION BY。我有一个用于运行查询的测试表。
id integer | num integer
___________|_____________
1 | 4
2 | 4
3 | 5
4 | 6
运行以下查询时,得到的输出与预期的一样。
SELECT id, COUNT(id) OVER(PARTITION BY num) from test;
id | count
___________|_____________
1 | 2
2 | 2
3 | 1
4 | 1
但是,当我将ORDER BY添加到分区时,
SELECT id, COUNT(id) OVER(PARTITION BY num ORDER BY id) from test;
id | count
___________|_____________
1 | 1
2 | 2
3 | 1
4 | 1
我的理解是,COUNT是在属于一个分区的所有行中计算的。在这里,我已按 num 对行进行了分区。不论是否带有ORDER BY子句,分区中的行数都是相同的。为什么输出会有所不同?
答案 0 :(得分:5)
在将order by
添加到用作窗口函数的聚合时,该聚合会变成“运行计数”(或您使用的任何聚合)。
count(*)
将根据指定的顺序返回直到“当前行”为止的行数。
以下查询显示与order by
一起使用的聚合的不同结果。用sum()
代替count()
,(在我看来)比较容易看到。
with test (id, num, x) as (
values
(1, 4, 1),
(2, 4, 1),
(3, 5, 2),
(4, 6, 2)
)
select id,
num,
x,
count(*) over () as total_rows,
count(*) over (order by id) as rows_upto,
count(*) over (partition by x order by id) as rows_per_x,
sum(num) over (partition by x) as total_for_x,
sum(num) over (order by id) as sum_upto,
sum(num) over (partition by x order by id) as sum_for_x_upto
from test;
将导致:
id | num | x | total_rows | rows_upto | rows_per_x | total_for_x | sum_upto | sum_for_x_upto
---+-----+---+------------+-----------+------------+-------------+----------+---------------
1 | 4 | 1 | 4 | 1 | 1 | 8 | 4 | 4
2 | 4 | 1 | 4 | 2 | 2 | 8 | 8 | 8
3 | 5 | 2 | 4 | 3 | 1 | 11 | 13 | 5
4 | 6 | 2 | 4 | 4 | 2 | 11 | 19 | 11
中有更多示例
答案 1 :(得分:2)
您的两个表达式是:
COUNT(id) OVER (PARTITION BY num)
COUNT(id) OVER (PARTITION BY num ORDER BY id)
您为什么希望它们返回相同的值?语法不同是有原因的。
第一个返回每个num
的总数-本质上是将合计值重新加入。
秒进行累加计数。对于COUNT()
的每一行,它都对id
的每一行执行id
。
请注意,此类累积计数通常使用RANK()
(或相关功能)来实现。
累积计数与RANK()
略有不同。累计计数实现:
COUNT(id) OVER (PARTITION BY num ORDER BY id RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
RANK()
略有不同。仅当ORDER BY
键绑定时,差异才重要。
答案 2 :(得分:1)
其他人已经解释了“为什么”。有时您有一个有序的窗口,尽管有ORDER BY
,但您还是必须对整个分区进行计数。
要这样做,请将unbounded range与RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
一起使用
create table search_log
(
id bigint not null primary key,
query varchar(255) not null,
stemmed_query varchar(255) not null,
created timestamp not null,
);
SELECT query,
created as seen_on,
first_value(created) OVER query_window as last_seen,
row_number() OVER query_window AS rn,
count(*) OVER query_window AS occurence
FROM search_log l
WINDOW query_window AS (PARTITION BY stemmed_query ORDER BY created DESC
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)