我有一个使用双冒号(::
)作为分隔符的数据集。如何在Hive中使用正则表达式serde来解析数据,以便将其导入到表中?
数据结构如下:
userId::movieId::rating::time
目前我正在使用此查询,但它为select语句提供了空值:
create table rating_regex(userId string,movieId string,rating string,time string) row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' with serdeproperties(
"input.regex" = "::"
) stored as textfile
答案 0 :(得分:5)
您需要为整个记录创建一个完整的正则表达式,然后声明输出格式。
示例:
CREATE TABLE rating_regex(
userId string,
movieId string,
rating string,
time string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH serdeproperties("input.regex" = "(.+)::(.+)::(.+)::(.+)",
"output.format.string" = "%1$s %2$s %3$s %4$s")
STORED AS TEXTFILE;
答案 1 :(得分:1)
只是先前的好答案的补充。如果输入文件中有多个定界符,也可以使用multidelimitserde。
假设您要将下面的输入文件加载到配置单元表中。
userId::movieId::rating::time
1111::Rambo::one::2016-01-04 00:12:06
CREATE EXTERNAL TABLE IF NOT EXISTS UDB.movie_rating (
userId VARCHAR(10)
,movieId VARCHAR(20)
,rating VARCHAR(5)
,movietime timestamp
)
comment 'This table will contain movie rating information.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'
WITH SERDEPROPERTIES ("field.delim"="::")
LOCATION '/hdfspathlocation/MULTISERDE'
tblproperties ("skip.header.line.count"="1")
;
select * from UDB.movie_rating;
+---------+----------+---------+------------------------+--+
| userid | movieid | rating | movietime |
+---------+----------+---------+------------------------+--+
| 1111 | Rambo | one | 2016-01-04 00:12:06.0 |
+---------+----------+---------+------------------------+--+