Hive - 通过聚合组之间的值来创建地图列类型

时间:2017-07-31 21:17:10

标签: sql hadoop types hive collect

我有一个看起来像这样的表:

|customer|category|room|date|
-----------------------------
|1       |   A    | aa | d1 |
|1       |   A    | bb | d2 |
|1       |   B    | cc | d3 |
|1       |   C    | aa | d1 |
|1       |   C    | bb | d2 |
|2       |   A    | aa | d3 |
|2       |   A    | bb | d4 |
|2       |   C    | bb | d4 |
|2       |   C    | ee | d5 |
|3       |   D    | ee | d6 |

我想从表格中创建两个地图:

第一即可。 map_customer_room_date :将分组客户和收集所有不同的会议室(密钥)和日期()。

我正在使用collect() UDF Brickhouse功能。

可以使用以下类似的方式存档:

select customer, collect(room,date) as map_customer_room_date
from table
group by customer

第二。 map_category_room_date 有点复杂,也包含相同的地图类型collect(room, date),它将包含所有类别的所有房间作为键客户X的类别。 这意味着对于 customer1 ,即使它属于 customer2 ,也会占用ee个空间。这是因为customer1具有类别C,此类别也存在于客户2中。

最终表格按客户分组,如下所示:

|customer| map_customer_room_date  |     map_category_room_date    |
-------------------------------------------------------------------|
|   1    |{aa: d1, bb: d2, cc: d3} |{aa: d1, bb: d2, cc: d3,ee: d6}|
|   2    |{aa: d3, bb: d4, ee: d6} |{aa: d3, bb: d4, ee: d6}       |
|   3    |{ee: d6}                 |{ee: d6}                       |  

我遇到了构建第二张地图并按照描述展示决赛桌的问题。 知道如何实现这一目标吗?

1 个答案:

答案 0 :(得分:1)

在将结果合并为2张地图之前,可以使用一系列自联接来查找同一类别中的其他房间。

代码

CREATE TABLE `table` AS
SELECT 1 AS customer, 'A' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'A' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 1 AS customer, 'B' AS category, 'cc' AS room, 'd3' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'aa' AS room, 'd3' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'ee' AS room, 'd5' AS `date` UNION ALL
SELECT 3 AS customer, 'D' AS category, 'ee' AS room, 'd6' AS `date`
;


SELECT
    customer_rooms.customer,
    collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
    collect(
        COALESCE(customer_category_rooms.room, category_rooms.room),
        COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
FROM `table` AS customer_rooms
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
WHERE (
    customer_rooms.customer = customer_category_rooms.customer AND
    customer_rooms.category = customer_category_rooms.category AND
    customer_rooms.room = customer_category_rooms.room AND
    customer_rooms.date = customer_category_rooms.date
)
OR (
    customer_category_rooms.customer IS NULL AND
    customer_category_rooms.category IS NULL AND
    customer_category_rooms.room IS NULL AND
    customer_category_rooms.date IS NULL
)
GROUP BY
    customer_rooms.customer
;

结果集

1   {"aa":"d1","bb":"d2","cc":"d3"} {"aa":"d1","bb":"d2","cc":"d3","ee":"d5"}
2   {"aa":"d3","bb":"d4","ee":"d5"} {"aa":"d3","bb":"d4","ee":"d5"}
3   {"ee":"d6"} {"ee":"d6"}

说明

FROM `table` AS customer_rooms

首先,结果来自最初的table。我们将此关系命名为customer_rooms。正如您在问题中已经提到的那样,这足以构建map_customer_room_date

JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category

第一个自加入标识所有与customer_rooms行中明确提到的房间具有相同类别的房间。我们将此关系命名为category_rooms

LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room

第二次自我加入会占用我们在category_rooms中确定的房间,并尝试查找此房间是否已由customer_rooms中标识的客户持有。我们将此关系命名为customer_category_rooms。这是LEFT OUTER JOIN,因为我们希望保留先前连接中的所有行。结果将是1)来自customer_roomscustomer_category_rooms的值相同,因为客户已经拥有此会议室,或2)来自customer_category_rooms的值将全部为{{1}因为顾客不会占用这个房间,但它是同一类别中的一个房间。这种区别将变得很重要,以便我们可以保留客户的NULL,如果他们已经占用了房间。

接下来,我们需要过滤。

date

这包括客户在原始WHERE ( customer_rooms.customer = customer_category_rooms.customer AND customer_rooms.category = customer_category_rooms.category AND customer_rooms.room = customer_category_rooms.room AND customer_rooms.date = customer_category_rooms.date ) 中明确持有的房间。

table

这包括客户不持有但与客户持有的房间属于同一类别的房间。

OR (
    customer_category_rooms.customer IS NULL AND
    customer_category_rooms.category IS NULL AND
    customer_category_rooms.room IS NULL AND
    customer_category_rooms.date IS NULL
)

collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date, 可以通过从表中收集原始数据来构建,我们将其别名为map_customer_room_date

customer_rooms

构建 collect( COALESCE(customer_category_rooms.room, category_rooms.room), COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date 更复杂。如果客户明确地拥有房间,那么我们希望保留map_category_room_date。但是,如果客户没有明确地保留房间,那么我们希望能够使用来自具有重叠类别的另一行的dateroom。为此,我们使用Hive COALESCE函数选择不是date的第一个值。如果客户已经拥有房间(在NULL中的非NULL值中显示),那么我们将使用该房间。如果没有,那么我们将使用customer_category_rooms中的值。

请注意,如果相同的类别/房间组合可以映射到多个category_rooms值,则仍可能存在一些歧义。如果这很重要,那么您可能需要投入更多工作来根据某些业务规则选择正确的date(例如,使用最快的date)或映射到多个date值而不是单一的价值。如果有其他类似的要求,这应该给你一个很好的起点。