使用窗口函数计算Hive中的每周滚动支出

时间:2019-06-14 19:32:26

标签: hadoop hive window-functions partition

我需要分配客户一周的长期支出。每次客户购物时,我都想知道他们过去一周在我们这里花费了多少。我想用我的Hive代码来做到这一点。

我的数据集与此类似:

支出表

Cust_ID | Purch_Date | Purch_Amount  
1 | 1/1/19 | $10  
1 | 1/2/19 | $21  
1 | 1/3/19 | $30  
1 | 1/4/19 | $11  
1 | 1/5/19 | $21  
1 | 1/6/19 | $31  
1 | 1/7/19 | $41  
2 | 1/1/19 | $12  
2 | 1/2/19 | $22  
2 | 1/3/19 | $32  
2 | 1/5/19 | $42  
2 | 1/7/19 | $52  
2 | 1/9/19 | $62  
2 | 1/11/19 | $72  

到目前为止,我已经尝试了类似于以下代码:

Select Cust_ID, Purch_Date, Purch_Amount, sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date) range between 604800 and current row) as Rolling_Spend from Spend_Table

Cust_ID | Purch_Date | Purch_Amount | Rolling_Spend  
1 | 1/1/19 | $10 | $10  
1 | 1/2/19 | $21 | $31  
1 | 1/3/19 | $30 | $61  
1 | 1/4/19 | $11 | $72  
1 | 1/5/19 | $21 | $93  
1 | 1/6/19 | $31 | $124  
1 | 1/7/19 | $41 | $165  
2 | 1/1/19 | $12 | $12  
2 | 1/2/19 | $22 | $34  
2 | 1/3/19 | $32 | $66  
2 | 1/5/19 | $42 | $108  
2 | 1/7/19 | $52 | $160  
2 | 1/9/19 | $62 | $188  
2 | 1/11/19 | $72 | $228  

我认为问题出在我的范围之间,因为它似乎在抢占前面的行数。我原以为它会在之前的几秒钟内抓取数据(604800为6天,以秒为单位)。

我想做的事可行吗?我无法进行前6行,因为并不是每个客户都像2号客户一样每天都在购物。


更新:通过将原始代码更改为:我可以使原始代码正常工作 Select Cust_ID, Purch_Date, Purch_Amount, sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date, 'MM-dd-yyyy') range between 604800 and current row) as Rolling_Spend from Spend_Table

关键是要在unix_timestamp公式中指定日期格式

1 个答案:

答案 0 :(得分:1)

SELECT *, sum(some_value) OVER (
        PARTITION BY Cust_ID 
        ORDER BY CAST(Purch_Date AS timestamp) 
        RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
     ) AS cummulativeSum FROM Spend_Table

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics