数组列值与配置单元中的正常列值之间的比较

时间:2015-09-18 06:51:26

标签: arrays hadoop hive

表1

Column1 Column2
1       1,2,10
2       11,12,13
3       1,2,14
4       20,1,10
5       11,12,13,14

表2

Column1 Column2
1       Purchase
2       Product View
10      Cart Open
11      Checkout
12      Cart Add
13      Cart Remove
14      Cart View
20      Campaign View

结果表应如下所示

Column1 Column2     DESC
1       1,2,10      Purchase, Product View, Cart Open
2       11,12,13    Checkout, Cart Add, Cart Remove
3       1,2,14      Purchase, Product View
4       20,1,10     Campaign View, Purchase, Cart Open
5       11,12,13,14 Checkout, Cart Add, Cart Remove, Cart View

注意:

Table1.column2 [0] == table2.column1然后它会在我们添加新结果表的desc列中显示table2.column2值。

我们可以在此查询中使用join吗?如果是的话,我们怎么能在蜂巢中做到?

请帮助解决此问题。

先谢谢, Anbu k

1 个答案:

答案 0 :(得分:0)

<强>查询

add jar /path/to/jars/brickhouse-0.7.1.jar;
create temporary function collect as "brickhouse.udf.collect.CollectUDAF";

select a.col1
  , collect(b.col1)
  , collect(b.col2)
from (
    select col1, exp_col2
    from db.tbl1
    lateral view explode(col2) exptbl as exp_col2 ) a
join db.tbl2 b
on b.col1=a.exp_col2
group by a.col1

<强>输出

1       [1, 2, 10]         ["Purchase","Product View","Cart Open"]
2       [11, 12, 13]       ["Checkout","Cart Add","Cart Remove"]
3       [1, 2, 14]         ["Purchase","Product View","Cart View"]
4       [1, 10, 20]        ["Purchase","Cart Open","Campaign View"]
5       [11, 12 ,13 ,14]   ["Checkout","Cart Add","Cart Remove","Cart View"]

请务必使用brickhouse collect而不是内置collect_list(),因为后者并不(必然)保留订单。