Question

在Redshift中给出以下查询：

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC(0.02) WITHIN GROUP (ORDER BY join_time) over(partition by 
trunc(joinstart_ev_timestamp))/1000 as mini,
median(join_time) over(partition by trunc(joinstart_ev_timestamp))/1000 as jt,
product_name as product,
endpoint as endpoint
from qe_datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

我需要将上面的Query转换为相应的Presto语法。我写的相应的Presto查询是：

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,
approx_percentile(cast(join_time as double),0.50) over (partition by 
cast(joinstart_ev_timestamp as date)) /1000 as jt,
product_name as product,
endpoint as endpoint
from datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

在这里，一切正常，但在行中显示错误：

PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
    over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,

它对应的Presto语法是什么？

Answer 1

如果Presto支持嵌套窗口函数，那么你可以使用NTH_VALUE和p * COUNT（*）OVER（PARTITION BY ...）来找到对应于＆＃34; p＆＃39;＆＃34;窗口中的百分位数。由于Presto不支持此功能，您需要加入一个子查询，而不是计算窗口中的记录数：

SELECT
  my_table.window_column,
  /* Replace :p with the desired percentile (in your case, 0.02) */
  NTH_VALUE(:p*subquery.records_in_window, my_table.ordered_column)
    OVER (PARTITION BY my_table.window_column ORDER BY my_table.ordered_column BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
FROM my_table
JOIN (
  SELECT
    window_column,
    COUNT(*) AS records_in_window
  FROM my_table
  GROUP BY window_column
) subquery ON subquery.window_column = my_table.window_column

以上概念上是接近但失败，因为:p*subquery.records_in_window是一个浮点数，偏移量需要是一个整数。你有几个方法可以解决这个问题。例如，如果您要查找中位数，则只需舍入到最接近的整数即可。如果你找到了第二个百分点，舍入不会起作用，因为它通常会给你0并且偏移量从1开始。在这种情况下，将天花板四舍五入到最接近的整数可能会更好。

Answer 2

我正在预先研究中位数，并找到了适合我的解决方案：

例如，我有一个联接表A_join_B，它具有列A_id和B_id。

我想找到与单个B相关的A数的中位数

SELECT APPPROX_PERCENTILE（计数，0.5）从（ SELECT COUNT（*）AS计数，narrative_id 来自A_join_B GROUP BY B_id ）；

Presto相当于Redshift的PERCENTILE_DISC

2 个答案: