我想获得一张桌子的第5,第50,第95百分位数
SELECT col1, col2, col3, AVG(col4), STD(col4),
PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 5th_percentile,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 50th_percentile,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 95th_percentile
FROM table
GROUP BY col1, col2, col3
LIMIT 100
我最终得到的是5th_percentile == 50th_percentile == 95th_percentile
AVG(col4) STD(col4) 5th_percentile 50th_percentile 95th_percentile
300.000000 0.000000 300.000000 300.000000 300.000000
67.076600 16.968851 82.031792 82.031792 82.031792
66.166136 11.452172 78.348846 78.348846 78.348846
544.262809 68.269014 605.797302 605.797302 605.797302
22.523138 1.820358 24.000000 24.000000 24.000000
怎么回事?
编辑:数据库是MemSQL
答案 0 :(得分:2)
窗口函数在 GROUP BY子句之后运行。 GROUP BY每组产生一行,这就是为什么PERCENTILE_CONT窗口函数都返回相同的值。
您想先计算窗口函数,然后再计算GROUP BY。您可以通过将窗口函数放在内部子选择中,将GROUP BY放在外部选择中来完成此操作。
这是来自postgres的文档,它解释了窗口函数如何与group by相关(这是标准的ANSI SQL,而MemSQL也做同样的事情):
https://www.postgresql.org/docs/current/static/tutorial-window.html
窗口函数考虑的行是查询的FROM子句生成的“虚拟表”的行,如其中的WHERE,GROUP BY和HAVING子句所过滤的那样。例如,任何窗口函数都看不到因为它不符合WHERE条件而被删除的行。查询可以包含多个窗口函数,这些函数通过不同的OVER子句以不同的方式对数据进行切片,但它们都作用于此虚拟表定义的相同行集合。
请注意,在MemSQL中,如果使用未分组或聚合的列(例如查询中的col4),则会从组中的行中获取任意值,即它的行为类似于ANY_VALUE聚合。在MemSQL的未来版本中,此查询将返回错误,以帮助您避免编写具有此类意外行为的查询。
答案 1 :(得分:0)
PERCENTILE_CONT()
- 至少在某些数据库中 - 可以是聚合函数或窗口函数。
我认为正在发生的是在聚合后计算的值 - 我不知道为什么。说实话,我希望代码得到语法错误,因为col4
没有聚合。换句话说,(ORDER BY MAX(col4))
应该有效,但不是(ORDER BY col4)
,因为百分位数在聚合之后计算。
但请尝试不使用OVER
子句:
SELECT col1, col2, col3, AVG(col4), STD(col4),
PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4) as 5th_percentile,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4) as 50th_percentile,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4) as 95th_percentile
FROM table
GROUP BY col1, col2, col3
LIMIT 100;
编辑:
您的数据库似乎不支持PERCENTILE_CONT()
作为聚合函数。没有考虑到味道。大多数人都这样做。
解决方法是SELECT DISTINCT
:
SELECT DISTINCT col1, col2, col3,
AVG(col4) OVER (PARTITION BY col1, col2, col3),
STD(col4) OVER (PARTITION BY col1, col2, col3),
PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4) OVER (PARTITION BY col1, col2, col3) as 5th_percentile,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4) OVER (PARTITION BY col1, col2, col3) as 50th_percentile,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4) OVER (PARTITION BY col1, col2, col3) as 95th_percentile
FROM table
LIMIT 100;
或使用子查询。
答案 2 :(得分:0)
WITH a AS (
SELECT col1, col2, col3,
PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 5th_percentile,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 50th_percentile,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY col4)
OVER (PARTITION BY col1, col2, col3) as 95th_percentile
FROM table
)
SELECT DISTINCT col1, col2, col3, 5th_percentile, 50th_percentile, 95th_percentile
FROM a
LIMIT 100
这很有效,看起来你不能用percentile_cont
做一个groupby