假设我有一个包含3行的配置单元表:merchant_id,week_id,acc_id。我的目标是在每周的前4周内收集唯一身份客户,而我正在使用移动窗口来做到这一点。
我的代码:
创建测试表:
CREATE TABLE table_test_test (merchant_id INT, week_id INT, acc_id INT);
INSERT INTO TABLE table_test_test VALUES
(1,0,8),
(1,0,9),
(1,0,10),
(1,2,1),
(1,2,2),
(1,2,4),
(1,4,1),
(1,4,3),
(1,4,4),
(1,5,1),
(1,5,3),
(1,5,5),
(1,6,1),
(1,6,5),
(1,6,6)
然后收集:
select
merchant_id,
week_id,
collect_set(acc_id) over (partition by merchant_id ORDER BY week_id RANGE BETWEEN 4 preceding AND 0 preceding) as uniq_accs_prev_4_weeks
from
table_test_test
结果表是:
merchant_id week_id uniq_accs_prev_4_weeks
1 1 0 []
2 1 0 []
3 1 0 []
4 1 2 [9,8,10]
5 1 2 [9,8,10]
6 1 2 [9,8,10]
7 1 4 [9,8,10,1,2,4]
8 1 4 [9,8,10,1,2,4]
9 1 4 [9,8,10,1,2,4]
10 1 5 [1,2,4,3]
11 1 5 [1,2,4,3]
12 1 5 [1,2,4,3]
13 1 6 [1,2,4,3,5]
14 1 6 [1,2,4,3,5]
15 1 6 [1,2,4,3,5]
如您所见,表中有多余的行。这只是一个例子,在我的实际情况下,此表很大,并且冗余会导致内存问题。
我曾经尝试使用distinct和group by,但是这些都没有。
有一个好的方法吗?非常感谢。
答案 0 :(得分:0)
效果很好:
select distinct merchant_id, week_id, uniq_accs_prev_4_weeks
from
(
select
merchant_id,
week_id,
collect_set(acc_id) over (partition by merchant_id ORDER BY week_id RANGE BETWEEN 4 preceding AND current row) as uniq_accs_prev_4_weeks
from
table_test_test
)s;
结果:
OK
1 0 [9,8,10]
1 2 [9,8,10,1,2,4]
1 4 [9,8,10,1,2,4,3]
1 5 [1,2,4,3,5]
1 6 [1,2,4,3,5,6]
Time taken: 98.088 seconds, Fetched: 5 row(s)
我的配置单元不接受0 preceding
,我换成了current row
。好像this bug或this bug,我的Hive版本是1.2。您应该可以在上面的子查询中添加“ distinct”。