我有一个大表,有超过1800万行,我想计算中位数,为此我使用了PRECENTILE。但是花费的时间约为17分钟,这并不理想。
这是我的查询
WITH raw_data AS
(
SELECT name AS series,
(duration) /(60000) AS value
FROM warehouse.table
),
quartiles AS
(
SELECT series,
value,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q1,
MEDIAN(value) OVER (PARTITION BY series) AS median,
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q3
FROM raw_data
)
SELECT series,
MIN(value) AS minimum,
AVG(q1) AS q1,
AVG(median) AS median,
AVG(q3) AS q3,
MAX(value) AS maximum
FROM quartiles
GROUP BY 1
有什么办法可以加快速度吗?
谢谢
答案 0 :(得分:1)
您的查询要求Redshift做很多工作。数据必须根据您的PARTITION
列进行分配,并根据ORDER BY
列进行排序。
有两种方法可以使其更快:
PARTITION
列作为分发键(DISTKEY(series)
)和第一排序键。使用ORDER BY
列作为第二个排序键(SORTKEY(series,value)
)。这将最小化回答查询所需的工作。节省的时间会有所不同,但是我在小型测试集群上使用这种方法,将3毫秒30 PERCENTILE_CONT
查询降至30毫秒。答案 1 :(得分:0)
要部分加快速度,请尝试以下操作
SELECT distinct
series,
value,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q1,
MEDIAN(value) OVER (PARTITION BY series) AS median,
PERCENTILE_CONT(0.75) WITHIN GROUP(ORDER BY value) OVER (PARTITION BY series) AS q3
FROM warehouse.table
这可能会更快,因为它更有可能正确使用表的排序/距离。 您将不得不在其他地方计算最小值和最大值。但至少要看它是否运行得更快。
答案 2 :(得分:0)
您可以尝试APPROXIMATE PERCENTILE_DISC ( percentile )
函数,该函数针对处理错误率低的分布式数据进行了优化,包括。中位数为0.5