apache spark sql中的等效percentile_cont函数

时间:2016-11-10 00:09:33

标签: apache-spark apache-spark-sql spark-dataframe

我是新手来激发环境。我的数据集的列名如下:

user_id,Date_time,order_quantity

我想计算每个user_id的order_quantity的第90个百分点。

如果它是sql,我会使用以下查询:

%sql 
SELECT user_id, PERCENTILE_CONT ( 0.9 ) WITHIN GROUP (ORDER BY order_quantity) OVER (PARTITION BY user_id)

然而,spark并没有内置支持使用percentile_cont函数。

有关如何在上述数据集中实现此功能的任何建议? 如果需要更多信息,请告诉我。

2 个答案:

答案 0 :(得分:1)

我有PERCENTILE_DISC(0.9)的解决方案,它将返回最接近百分位数0.9的离散order_quantity(无插值)。 想法是计算PERCENT_RANK,减去0.9并计算绝对值,然后取最小值: %sql WITH temp1 AS ( SELECT user_id, ABS(PERCENTILE_RANK () OVER (PARTITION BY user_id ORDER BY order_quantity) -0.9) AS perc_90_temp SELECT user_id, FIRST_VALUE(order_quantity) OVER (PARTITION BY user_id ORDER BY perc_90_temp) AS perc_disc_90 FROM temp1;

答案 1 :(得分:0)

我也在处理类似的问题。我在SAP HANA工作,然后转到Databricks上的Spark SQL。我已经迁移了以下SAP HANA查询:

SELECT 
    DISTINCT ITEM_ID, 
    LOCATION_ID, 
    PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY VENTAS) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y, 
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY PRECIO) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO 
FROM MY_TABLE

SELECT DISTINCT
  ITEM_ID,
  LOCATION_ID,
  PERCENTILE(VENTAS,0.8) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
  PERCENTILE(PRECIO,0.5) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM
    delta.`MY_TABLE`

在您的特定情况下,应如下所示:

SELECT DISTINCT user_id, PERCENTILE(order_quantity,0.9) OVER (PARTITION BY user_id)

我希望这会有所帮助。