我想知道是否有优化以下HiveQL(或SQL,出于好奇心)的技巧。
例如,如果我有一张表:
x | y | e | time
2 | 5 | 1 | 11:30:00
2 | 1 | 1 | 12:15:00
8 | 0 | 1 | 16:00:00
10 | 6 | 2 | 16:06:00
1 | 2 | 2 | 17:00:00
我希望获得多个聚合:
select
e,
time,
sum(x) over w as cumu_x
sum(y) over w as cumu_y
count(x) over w as num_x
from my_table
window w as
(partition by e
order by time
rows between unbounded preceding and current row)
应该给我想要的结果
e | time | cumu_x | cumu_y | num_x
1 | 11:30:00 | 2 | 5 | 1
1 | 12:15:00 | 4 | 6 | 2
1 | 16:00:00 | 12 | 6 | 3
2 | 16:06:00 | 10 | 6 | 1
2 | 17:00:00 | 11 | 8 | 2
问题:如何优化?当涉及数百万行时,此类Hive查询非常慢。
如果我自己循环数据,我会: