Hive列数据在数组中

时间:2017-03-06 19:57:20

标签: hadoop hive

我有一个包含小时数据的表格。我想找到一个小时数和数组中所有小时的值。 输入表

+-----+-----+-----+
| hour| col1| col2|
+-----+-----+-----+
| 00  | 0.0 | a   |
| 04  | 0.1 | b   |
| 08  | 0.2 | c   |
| 12  | 0.0 | d   |
+-----+-----+-----+

正如下面的解决方案所示,我使用函数来获取数组中的列值

select count(hr), 
       map_values(str_to_map(concat_ws(
         ',', 
         collect_set(
           concat_ws(':', reflect('java.util.UUID','randomUUID'), cast(col1 as string))
         )
       ))) as col1_arr,
       map_values(str_to_map(concat_ws(
         ',', 
         collect_set(
           concat_ws(':',reflect('java.util.UUID','randomUUID'), cast(col12 as string))
         )
       ))) as col2_arr from table;

我得到的输出,col2_arr中的值与col1_arr的顺序不同。请建议如何以相同的顺序获取不同列的数组/列表中的值。

+----------+-----------------+----------+
| count(hr)| col1_arr        | col2_arr | 
+----------+-----------------+----------+
| 4        | 0.0,0.1,0.2,0.0 | b,a,c,d  | 
+----------+----------------+-----------+

必需的输出:

+----------+-----------------+----------+
| count(hr)| col1_arr        | col2_arr | 
+----------+-----------------+----------+
| 4        | 0.0,0.1,0.2,0.0 | a,b,c,d  | 
+----------+----------------+-----------+

1 个答案:

答案 0 :(得分:0)

with    t as 
        (   
            select  inline
                    (
                        array
                        (
                            struct('00',0.0)
                           ,struct('04',0.1)
                           ,struct('08',0.2)
                           ,struct('12',0.0)
                        )
                    ) as (hour,col1)
        )

select  count(*),collect_list(col1),max(col1)
from    t
;
+-----+-------------------+-----+
| _c0 |        _c1        | _c2 |
+-----+-------------------+-----+
|   4 | [0.0,0.1,0.2,0.0] | 0.2 |
+-----+-------------------+-----+

如果要保证数组中元素的顺序,请使用 -

sort_array(collect_list(col1)) 

如果要消除数组中元素的重复,请使用 -

collect_set(col1)

保留没有collect_list的重复值

with    t as 
        (   
            select  inline
                    (
                        array
                        (
                            struct('00',0.0)
                           ,struct('04',0.0)
                           ,struct('08',0.1)
                           ,struct('12',0.1)
                        )
                    ) as (hour,col1)
        )

select  map_values(str_to_map(concat_ws(',',collect_set(concat_ws(':',reflect('java.util.UUID','randomUUID'),cast(col1 as string))))))
from    t
;
["0.0","0.0","0.1","0.1"]