Hive - 数组中相同的记录序列

时间:2017-03-14 14:05:33

标签: hadoop hive

我有一个包含小时数据的表格。我想找到一个小时数以及数组中所有小时的col1和col2的值。输入表

+-----+-----+-----+
| hour| col1| col2|
+-----+-----+-----+
| 00  | 0.0 | a   |
| 04  | 0.1 | b   |
| 08  | 0.2 | c   |
| 12  | 0.0 | d   |
+-----+-----+-----+

我使用以下查询来获取数组中的列值

查询: select count(hr),map_values(str_to_map(concat_ws(',',collect_set(concat_ws)':',reflect(' java.util.UUID' ,' randomUUID'),cast(col1 as string))))))as col1_arr,map_values(str_to_map(concat_ws(',',collect_set(concat_ws(': ',反映(' java.util.UUID',' randomUUID'),cast(col2 as string))))))作为表中的col2_arr;

我得到的输出,col2_arr中的值与col1_arr的顺序不同。请建议如何以相同的顺序获取不同列的数组/列表中的值。

+----------+-----------------+----------+
| count(hr)| col1_arr        | col2_arr | 
+----------+-----------------+----------+
| 4        | 0.0,0.1,0.2,0.0 | b,a,c,d  | 
+----------+----------------+-----------+

Required output:

+----------+-----------------+----------+
| count(hr)| col1_arr        | col2_arr | 
+----------+-----------------+----------+
| 4        | 0.0,0.1,0.2,0.0 | a,b,c,d  | 
+----------+----------------+-----------+

由于

1 个答案:

答案 0 :(得分:0)

select  count(*) as cnt 
       ,concat_ws(',',sort_array(collect_list(hour)))  as hour
       ,regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws(':',hour,cast(col1 as string))))),'..:','') as col1
       ,regexp_replace(concat_ws(',',sort_array(collect_list(concat_ws(':',hour,col2)))),'..:','') as col2

from    mytable
;
+-----+-------------+-------------+---------+
| cnt |    hour     |    col1     |  col2   |
+-----+-------------+-------------+---------+
|   4 | 00,04,08,12 | 0,0.1,0.2,0 | a,b,c,d |
+-----+-------------+-------------+---------+