在配置单元中的collect_list()内排序

时间:2018-06-08 18:47:27

标签: hive hiveql

我们说我有一个看起来像这样的蜂巢表:

ID    event    order_num
------------------------
A      red         2
A      blue        1
A      yellow      3
B      yellow      2
B      green       1
...

我尝试使用collect_list为每个ID生成事件列表。如下所示:

SELECT ID, 
collect_list(event) as events_list,
FROM table
GROUP BY ID;

但是,在我分组的每个ID中,我需要按order_num排序。所以我的结果表看起来像这样:

ID    events_list
------------------------
A      ["blue","red","yellow"]
B      ["green","red"]

我无法在collect_list()查询之前通过ID和order_num进行全局排序,因为该表非常庞大。有没有办法按照collect_list中的order_num排序?

谢谢!

3 个答案:

答案 0 :(得分:2)

所以,我找到了answer here。诀窍是使用带有DISTRIBUTE BY和SORT BY语句的子查询。见下文:

pan.delegate = cell

答案 1 :(得分:0)

函数sort_array()应该对collect_list()

进行排序
select ID, sort_array(collect_list(event)) as events_list,
from table
group by ID;

答案 2 :(得分:0)

尝试以下操作:

WITH tmp AS (
  SELECT * FROM data DISTRIBUTE BY ID SORT BY ID, order_num desc
)
SELECT ID, collect_list(event)
FROM tmp
GROUP BY ID