我是新手来激发环境。我的数据集的列名如下:
我想计算每个user_id的order_quantity的第90个百分点。
如果它是sql,我会使用以下查询:
%sql
SELECT user_id, PERCENTILE_CONT ( 0.9 ) WITHIN GROUP (ORDER BY order_quantity) OVER (PARTITION BY user_id)
然而,spark并没有内置支持使用percentile_cont函数。
有关如何在上述数据集中实现此功能的任何建议? 如果需要更多信息,请告诉我。
答案 0 :(得分:1)
我有PERCENTILE_DISC(0.9)的解决方案,它将返回最接近百分位数0.9的离散order_quantity(无插值)。
想法是计算PERCENT_RANK,减去0.9并计算绝对值,然后取最小值:
%sql
WITH temp1 AS (
SELECT
user_id,
ABS(PERCENTILE_RANK () OVER
(PARTITION BY user_id ORDER BY order_quantity) -0.9) AS perc_90_temp
SELECT
user_id,
FIRST_VALUE(order_quantity) OVER
(PARTITION BY user_id ORDER BY perc_90_temp) AS perc_disc_90
FROM
temp1;
答案 1 :(得分:0)
我也在处理类似的问题。我在SAP HANA工作,然后转到Databricks上的Spark SQL。我已经迁移了以下SAP HANA查询:
SELECT
DISTINCT ITEM_ID,
LOCATION_ID,
PERCENTILE_CONT(0.8) WITHIN GROUP (ORDER BY VENTAS) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY PRECIO) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM MY_TABLE
到
SELECT DISTINCT
ITEM_ID,
LOCATION_ID,
PERCENTILE(VENTAS,0.8) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS P95Y,
PERCENTILE(PRECIO,0.5) OVER (PARTITION BY ITEM_ID, LOCATION_ID) AS MEDIAN_PRECIO
FROM
delta.`MY_TABLE`
在您的特定情况下,应如下所示:
SELECT DISTINCT user_id, PERCENTILE(order_quantity,0.9) OVER (PARTITION BY user_id)
我希望这会有所帮助。