使用SQL

时间:2017-03-05 11:30:04

标签: sql apache-spark-sql

我的数据集中有3列:

  1. 货币
  2. 新近度
  3. 频率
  4. 我想再创建3个列,如M_P,R_Q,F_Q,其中包含使用SQL的每个值Monetary,Recency和Frequency的百分位值。

    提前谢谢。

    Customer_ID    Frequency Recency    Monetary    R_Q     F_Q        M_Q
    112                 1      39          7.05      0.398   0.789    0.85873
    143                 1      23          0.1833    0.232  0.7895   0.1501
    164                 1      52          0.416      0.508   0.789  0.295
    123                 1      118          1.1        0.98   0.789  0.52
    

1 个答案:

答案 0 :(得分:0)

您正在寻找的功能是ANSI标准函数ntile()

select t.*,
       ntile(100) over (order by monetary) as percentile_monetary,
       ntile(100) over (order by recency) as percentile_recency,
       ntile(100) over (order by frequency) as percentile_frequency
from t;

这在大多数数据库中都可用。

您可以使用rank()count()来计算百分位数。根据您想要处理关系的方式以及您是否需要1-100或0-100的值,以下应该是一个很好的起点:

select t.*,
       (1 + rank_monetary * 100.0 / cnt) as percentile_monetary,
       (1 + rank_recency * 100.0 / cnt) as percentile_recency,
       (1 + rank_frequency * 100.0 / cnt) as percentile_frequency
from (select t.*,
             count(*) over () as cnt,
             rank() over (order by monetary) - 1 as rank_monetary,
             rank() over (order by recency) - 1 as rank_recency,
             rank() over (order by frequency) - 1 as rank_frequency
      from t
     ) t;