Presto相当于Redshift的PERCENTILE_DISC

时间:2018-02-13 06:54:39

标签: mysql amazon-redshift presto amazon-redshift-spectrum

在Redshift中给出以下查询:

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC(0.02) WITHIN GROUP (ORDER BY join_time) over(partition by 
trunc(joinstart_ev_timestamp))/1000 as mini,
median(join_time) over(partition by trunc(joinstart_ev_timestamp))/1000 as jt,
product_name as product,
endpoint as endpoint
from qe_datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

我需要将上面的Query转换为相应的Presto语法。 我写的相应的Presto查询是:

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,
approx_percentile(cast(join_time as double),0.50) over (partition by 
cast(joinstart_ev_timestamp as date)) /1000 as jt,
product_name as product,
endpoint as endpoint
from datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

在这里,一切正常,但在行中显示错误:

PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
    over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,

它对应的Presto语法是什么?

2 个答案:

答案 0 :(得分:0)

如果Presto支持嵌套窗口函数,那么你可以使用NTH_VALUE和p * COUNT(*)OVER(PARTITION BY ...)来找到对应于&#34; p&#39;& #34;窗口中的百分位数。由于Presto不支持此功能,您需要加入一个子查询,而不是计算窗口中的记录数:

SELECT
  my_table.window_column,
  /* Replace :p with the desired percentile (in your case, 0.02) */
  NTH_VALUE(:p*subquery.records_in_window, my_table.ordered_column)
    OVER (PARTITION BY my_table.window_column ORDER BY my_table.ordered_column BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM my_table
JOIN (
  SELECT
    window_column,
    COUNT(*) AS records_in_window
  FROM my_table
  GROUP BY window_column
) subquery ON subquery.window_column = my_table.window_column

以上概念上是接近但失败,因为:p*subquery.records_in_window是一个浮点数,偏移量需要是一个整数。你有几个方法可以解决这个问题。例如,如果您要查找中位数,则只需舍入到最接近的整数即可。如果你找到了第二个百分点,舍入不会起作用,因为它通常会给你0并且偏移量从1开始。在这种情况下,将天花板四舍五入到最接近的整数可能会更好。

答案 1 :(得分:0)

我正在预先研究中位数,并找到了适合我的解决方案:

例如,我有一个联接表A_join_B,它具有列A_id和B_id。

我想找到与单个B相关的A数的中位数

SELECT APPPROX_PERCENTILE(计数,0.5) 从 ( SELECT COUNT(*)AS计数,narrative_id 来自A_join_B GROUP BY B_id );