我有一个数据集,我们称其为d1,其中包含以下信息:
ID count
1 5
2 2
3 6
4 6
5 4
6 3
如果我想要中位数,将使用[1,1,1,1,1,2,2,...,6,6,6]进行计算,因为存在一个数量为重复多次。结果将是3.5(因为我们得到3和4,并且我们对它们进行了平均)。我一直在尝试对子查询使用limit,但是我不能,因此我不知道如何在得到偶数时获得中间值或中间值的平均值。
如何在SQL中执行此操作?
答案 0 :(得分:5)
您可以使用generate_series
从1到count
每行扩展数据集,然后应用percentile_cont
排序的集合聚合函数。这将适用于PostgreSQL 9.4 +
自包含的示例:
WITH x(id, cnt) as (
values
(1, 5),
(2, 2),
(3, 6),
(4, 6),
(5, 4),
(6, 3)
)
SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY id) med
FROM x, generate_series(1,cnt)
# outputs:
med
3.5
另一种选择是使用窗口函数来确定应该求平均值以获取中位数的元素的位置
WITH x(id,"cnt") as (
values
(1,5),
(2,2),
(3,6),
(4,6),
(5,4),
(6,3)
)
, windowed AS (
SELECT id, SUM(cnt) OVER w a, SUM(cnt) OVER u b, SUM(cnt) OVER v / 2.0 c
FROM x
WINDOW u AS (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING),
v AS (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING),
w AS (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
)
SELECT AVG(id) med
FROM windowed
WHERE c BETWEEN b AND a
答案 1 :(得分:1)
我发现这是一种相对简单的方法:
select avg(id)
from (select x.*,
sum(cnt) over (order by id) as running_cnt,
sum(cnt) over () as total_cnt
from x
) x
where running_cnt >= total_cnt / 2.0 and
running_cnt - cnt <= total_cnt / 2.0;
Here是db <>小提琴。