我有一行格式为:
powershell.exe
我希望输出格式为:
Col1 col2 col3
1 (1,2,3) (A,B,C)
我无法使用explode来获取此格式的数据
答案 0 :(得分:0)
您可以尝试这种方法:
注册Hive UDF
找到您的发行版的hive-contrib-<version>.jar
,并将其添加到Hive中,以便所有节点都可以使用它。以下路径来自Hortonworks HDP发行版。
hive (default)> add jar /usr/hdp/current/hive-client/lib/hive-contrib-1.2.1000.2.5.3.0-37.jar;
注册row_sequence()
UDF。
hive (default)> create temporary function row_sequence as 'org.apache.hadoop.hive.contrib.udf.UDFRowSequence';
执行查询
SELECT e.col1 col1, f.col2 col2, f.col3 col3
FROM
hive_tbl e,
(
SELECT c.col2, d.col3
FROM
(
SELECT row_sequence() entryseq, a.col2
FROM
(
SELECT EXPLODE(col2) col2
FROM hive_tbl
) a
) c
JOIN
(
SELECT row_sequence() entryseq, b.col3
FROM
(
SELECT EXPLODE(col3) col3
FROM hive_tbl
) b
) d
ON
c.entryseq = d.entryseq
) f;
<强>插图:强>
数据文件
-- testarray.txt
1|1,2,3|A,B,C
将文件加载到HDFS
hadoop fs -mkdir /hive-data/array
hadoop fs -put testarray.txt /hive-data/array
创建并验证Hive表
CREATE EXTERNAL TABLE `default.hive_tbl`(
col1 string,
col2 array<string>,
col3 array<string>
)
row format delimited
fields terminated by '|'
collection items terminated by ','
lines terminated by '\n'
LOCATION 'hdfs:////hive-data/array';
select * from hive_tbl;
hive_tbl.col1 hive_tbl.col2 hive_tbl.col3
1 ["1","2","3"] ["A","B","C"]
执行查询
-- query output
col1 col2 col3
1 1 A
1 2 B
1 3 C