在表1中,我有customer_id,item_id和item_rank(根据某些销售情况的项目排名)。我想收集每个customer_id的项目列表,并根据item_rank进行排列。
Customer_id item_id rank_item
23 2 3
23 2 3
23 4 2
25 5 1
25 4 2
我期望的输出是
Customer_id item_list
23 4,2
25 5,4
我使用的代码是
SELECT
customer_id,
concat_ws(',',collect_list (string(item_id))) AS item_list
FROM
table1
GROUP BY
customer_id
ORDER BY
item_rank
答案 0 :(得分:6)
您可以使用子查询获取(customer_id,item_id,item_rank)的结果集,按item_rank排序,然后在外部查询中使用collect_set
。
<强>查询强>
WITH table1 AS (
SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
SELECT 23 AS customer_id, 4 AS item_id, 2 AS item_rank UNION ALL
SELECT 25 AS customer_id, 5 AS item_id, 1 AS item_rank UNION ALL
SELECT 25 AS customer_id, 4 AS item_id, 2 AS item_rank
)
SELECT
subquery.customer_id,
collect_set(subquery.item_id) AS item_id_set
FROM (
SELECT
table1.customer_id,
table1.item_id,
table1.item_rank
FROM table1
DISTRIBUTE BY
table1.customer_id
SORT BY
table1.customer_id,
table1.item_rank
) subquery
GROUP BY
subquery.customer_id
;
<强>结果
customer_id item_id_set
0 23 [4,2]
1 25 [5,4]
子查询使用DISTRIBUTE BY
来保证特定customer_id
的所有行都路由到同一个reducer。然后,它使用SORT BY
按每个reducer中的customer_id
和item_rank
排序。我希望这对于要求是足够的,因为我没有注意到对最终结果集的总排序的要求。 (如果需要按customer_id
进行总排序,那么我认为查询必须使用ORDER BY
,这会导致执行速度变慢。)
在内部,collect_set
UDAF使用Java LinkedHashSet
,这是一个保留订单的集合,因此子查询中使用的相同排序顺序将保留在外部查询的集合中。这在Hive代码库中可见:
答案 1 :(得分:0)
选择 客户ID, collect_set(item_id)AS item_list 从 表格1 通过...分组 客户ID 订购 item_rank
注意:使用collect_list()可以重复,而collect_set()则可以唯一。