如何在hive中使用collect_set()操作来使用order

时间:2017-07-13 23:51:05

标签: sql hive

在表1中,我有customer_id,item_id和item_rank(根据某些销售情况的项目排名)。我想收集每个customer_id的项目列表,并根据item_rank进行排列。

Customer_id  item_id rank_item
  23            2      3
  23            2      3
  23            4      2
  25            5      1
  25            4      2

我期望的输出是

Customer_id    item_list
  23             4,2
  25             5,4

我使用的代码是

 SELECT
    customer_id,
    concat_ws(',',collect_list (string(item_id))) AS item_list
FROM
    table1
GROUP BY
    customer_id
ORDER BY
    item_rank

2 个答案:

答案 0 :(得分:6)

您可以使用子查询获取(customer_id,item_id,item_rank)的结果集,按item_rank排序,然后在外部查询中使用collect_set

<强>查询

WITH table1 AS (
    SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
    SELECT 23 AS customer_id, 2 AS item_id, 3 AS item_rank UNION ALL
    SELECT 23 AS customer_id, 4 AS item_id, 2 AS item_rank UNION ALL
    SELECT 25 AS customer_id, 5 AS item_id, 1 AS item_rank UNION ALL
    SELECT 25 AS customer_id, 4 AS item_id, 2 AS item_rank
)
SELECT
    subquery.customer_id,
    collect_set(subquery.item_id) AS item_id_set
FROM (
    SELECT
        table1.customer_id,
        table1.item_id,
        table1.item_rank
    FROM table1
    DISTRIBUTE BY
        table1.customer_id
    SORT BY
        table1.customer_id,
        table1.item_rank
) subquery
GROUP BY
    subquery.customer_id
;

<强>结果

    customer_id item_id_set
0   23  [4,2]
1   25  [5,4]

子查询使用DISTRIBUTE BY来保证特定customer_id的所有行都路由到同一个reducer。然后,它使用SORT BY按每个reducer中的customer_iditem_rank排序。我希望这对于要求是足够的,因为我没有注意到对最终结果集的总排序的要求。 (如果需要按customer_id进行总排序,那么我认为查询必须使用ORDER BY,这会导致执行速度变慢。)

在内部,collect_set UDAF使用Java LinkedHashSet,这是一个保留订单的集合,因此子查询中使用的相同排序顺序将保留在外部查询的集合中。这在Hive代码库中可见:

https://github.com/apache/hive/blob/release-2.0.0/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFMkCollectionEvaluator.java#L93

答案 1 :(得分:0)

选择     客户ID,     collect_set(item_id)AS item_list 从     表格1 通过...分组     客户ID 订购     item_rank

注意:使用collect_list()可以重复,而collect_set()则可以唯一。