如果分隔符本身存在于数据中,如何将数据加载到hive表中?

时间:2018-05-04 02:15:17

标签: hive

我正在将数据加载到数据库本身包含逗号的hive表中。

input file:emp.csv 

101,deepak,kumar,das
102,sumita,kumari,das
103,rajesh kumar das

output :
id  name
101 deepak kumar das
102 sumita kumari das
103 rajesh kumar das

当我创建下面的hive表并加载数据时,数据不正确:

 create external table hive_test(
 id int, name string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/hive_demo';

load data local inpath '/home/cloudera/hadoop/hive_demo/emp.csv' overwrite into table hive_test;

hive> select * from hive_test;
101 deepak
102 sumita
103 rajesh kumar das

所以我创建了下表,但它给出了错误。

create external table hive_test1(
id int,
name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES(
"separatorChar" = ",",
"quoteChar" = "'",
"escapeChar" = "\,")
STORED AS TEXTFILE
LOCATION '/hive_demo';
load data local inpath '/home/cloudera/hadoop/hive_demo/emp.csv' overwrite into table hive_test1;

select * from hive_test1;
Failed with exception 
java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: 
java.lang.UnsupportedOperationException: The separator, quote, and escape characters must be different!

如何将数据加载到Hive表?

1 个答案:

答案 0 :(得分:0)

在假设下面提供解决方案:

  • 您始终只需要从csv中提取2个col。
  • 第一个col是数字,第二个col延伸到第一个','字符后的行尾。
  • 您需要将name列中的任何','字符替换为空格。

使用RegexSerDe定义表并加载

create external table hive_test(
id int, name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    "input.regex" = "^(\d+),(.*)$" -- 2 regex groups as per assumption
)
STORED AS TEXTFILE;
LOCATION '/path/to/table';
LOAD data local inpath '/path/to/local/csv' overwrite into table hive_test;

name列中的','替换为空格

create table hive_test1 as 
select id, regexp_replace(name, ',', ' ') as name
from hive_test;

然后,在select * from hive_test1上,您将获得以下内容:

  

101 deepak kumar das
  102 sumita kumari das
  103 rajesh kumar das