Hadoop / Hive查询将一列拆分为多个列

时间:2011-11-07 16:53:12

标签: database hadoop hive

我正在使用两个表格(或多或少)的HIVE:

-TABLE1定义为[(变量:字符串),(Value1:int),(Value2:int)]

字段“变量”看起来像“x0,x1,x2,x3,...,xn”

-TABLE2定义为[(Value1Sum:int),(Value2Sum:int),(X1:string),(X4:string),(X17:string)]

我使用查询“将”table1“转换”为table2:

INSERT OVERWRITE TABLE table2
    SELECT sum(v1), sum(v2), x1, x4, x17
        FROM (SELECT
                Value1 as v1,
                Value2 as v2,
                split(Variables, ",")[1] as x1,
                split(Variables, ",")[4] as x4,
                split(Variables, ",")[17] as x17 
              FROM Table1) tmp
        GROUP BY tmp.x1, tmp.x4, tmp.x17

Hive是否会将拆分功能调用3次?

有没有办法让它更优雅?

有没有办法让它更通用?

祝你好运, CC

1 个答案:

答案 0 :(得分:3)

是的,每次都会调用split。你可以使它更优雅:

为什么不将Variables定义为一个数组列?他们可以直接访问元素:

select Varaibles[1] from table1

我假设你正在使用外部表,所以你可以这样做:

create external table table1(variables array<string>, a int, b int)
ROW FORMAT DELIMITED
    COLLECTION ITEMS TERMINATED BY ','
LOCATION 'hdfs://somewhere'