我有一个包含小时数据的表格。我想找到一个小时数和数组中所有小时的值。 输入表
+-----+-----+-----+
| hour| col1| col2|
+-----+-----+-----+
| 00 | 0.0 | a |
| 04 | 0.1 | b |
| 08 | 0.2 | c |
| 12 | 0.0 | d |
+-----+-----+-----+
正如下面的解决方案所示,我使用函数来获取数组中的列值
select count(hr),
map_values(str_to_map(concat_ws(
',',
collect_set(
concat_ws(':', reflect('java.util.UUID','randomUUID'), cast(col1 as string))
)
))) as col1_arr,
map_values(str_to_map(concat_ws(
',',
collect_set(
concat_ws(':',reflect('java.util.UUID','randomUUID'), cast(col12 as string))
)
))) as col2_arr from table;
我得到的输出,col2_arr中的值与col1_arr的顺序不同。请建议如何以相同的顺序获取不同列的数组/列表中的值。
+----------+-----------------+----------+
| count(hr)| col1_arr | col2_arr |
+----------+-----------------+----------+
| 4 | 0.0,0.1,0.2,0.0 | b,a,c,d |
+----------+----------------+-----------+
必需的输出:
+----------+-----------------+----------+
| count(hr)| col1_arr | col2_arr |
+----------+-----------------+----------+
| 4 | 0.0,0.1,0.2,0.0 | a,b,c,d |
+----------+----------------+-----------+
答案 0 :(得分:0)
with t as
(
select inline
(
array
(
struct('00',0.0)
,struct('04',0.1)
,struct('08',0.2)
,struct('12',0.0)
)
) as (hour,col1)
)
select count(*),collect_list(col1),max(col1)
from t
;
+-----+-------------------+-----+
| _c0 | _c1 | _c2 |
+-----+-------------------+-----+
| 4 | [0.0,0.1,0.2,0.0] | 0.2 |
+-----+-------------------+-----+
如果要保证数组中元素的顺序,请使用 -
sort_array(collect_list(col1))
如果要消除数组中元素的重复,请使用 -
collect_set(col1)
保留没有collect_list的重复值
with t as
(
select inline
(
array
(
struct('00',0.0)
,struct('04',0.0)
,struct('08',0.1)
,struct('12',0.1)
)
) as (hour,col1)
)
select map_values(str_to_map(concat_ws(',',collect_set(concat_ws(':',reflect('java.util.UUID','randomUUID'),cast(col1 as string))))))
from t
;
["0.0","0.0","0.1","0.1"]