我在HIVE工作,
我有一个像:
这样的数据集client_id date nb_pts
1 2016-06-01 1
1 2016-06-02 3
1 2016-06-03 4
2 2016-06-01 2
2 2016-06-02 3
我需要为每个客户端输出当前nb_pts和之前的nb_pts之间的差异。 所以我的输出应该是:
client_id date nb_pts nb_pts_per_row
1 2016-06-01 1 1 (1-0)
1 2016-06-02 3 2 (3-1)
1 2016-06-03 4 1 (4-3)
2 2016-06-01 2 2 (2-0)
2 2016-06-02 3 1 (3-2)
我尝试使用LAG功能联合国HIVE:
SELECT client_id, date, nb_pts,
nb_pts - (LAG(nb_pts, 1, 0) OVER (PARTITION BY client_id ORDER BY date ROWS 1 PRECEDING)) as nb_pts_per_row
FROM MyTable
但验证失败了。它说:
无法将窗口调用分解为组。至少有一个组必须仅依赖于输入列。还要检查循环依赖性。底层错误:期望左边窗口框架边界为函数LAG((TOK_TABLE_OR_COL nb_pts),1,0)org.apache.hadoop.hive.ql.parse.WindowingSpec$WindowSpec@27a007cd为LAG_window_0是无限制的。
编辑(解决方案):
所以没有ROWS 1 PRECEDING就可以了:
SELECT client_id, date, nb_pts,
nb_pts - (LAG(nb_pts, 1, 0) OVER (PARTITION BY client_id ORDER BY date)) as nb_pts_per_row
FROM MyTable