如何将具有不同分隔符的CSV加载到单个Hadoop表

时间:2017-06-02 07:42:38

标签: csv hadoop hive

我想用多个CSV文件填充Hive表。问题是并非所有文件都具有相同的分隔符。在表创建中,我只能指定一个分隔符,例如〜

create table status (type string, ...) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties ("separatorChar" = "~")
STORED AS TEXTFILE

Hive是否有内置功能,允许多个CSV分隔符?我知道这些文件可以在加载之前由Hadoop作业标准化,或者基于https://stackoverflow.com/a/26356592/2207078我可以使用pig来做它但我正在寻找一些内置功能。理想情况下,我想创建没有指定分隔符的状态表,并指示Hive如何在LOAD上分隔列。

1 个答案:

答案 0 :(得分:1)

演示

数据文件

comma.txt

|Now|,I've,heard,there,was
a,secret,chord;,That,David 
played,||and||,it,,pleased
the,,,Lord;,

semicolon.txt

But;;you;don't;really 
|care|;for;music;do;||||| you |||||?

pipeline.txt

,It,|,goes,|,like,|,this,|,the, 
fourth|the|fifth|The|;minor n
fall|the|;major|lift|The
baffled|king||composing|hallelujah

<强> DDL

create external table mytable 
(c1 string,c2 string,c3 string,c4 string,c5 string)
partitioned by (delim string)
;

alter table mytable set serdeproperties ('field.delim'=',');
alter table mytable add partition (delim='comma');  

alter table mytable set serdeproperties ('field.delim'=';');
alter table mytable add partition (delim='semicolon');

alter table mytable set serdeproperties ('field.delim'='|');
alter table mytable add partition (delim='pipeline');

将文件放在匹配的目录中

mytable
├── delim=comma
│   └── comma.txt
├── delim=pipeline
│   └── pipeline.txt
└── delim=semicolon
    └── semicolon.txt
select * from mytable
;
+---------+---------+--------+-----------+------------------+-----------+
|   c1    |   c2    |   c3   |    c4     |        c5        |   delim   |
+---------+---------+--------+-----------+------------------+-----------+
| |Now|   | I've    | heard  | there     | was              | comma     |
| a       | secret  | chord; | That      | David            | comma     |
| played  | ||and|| | it     |           | pleased          | comma     |
| the     |         |        | Lord;     |                  | comma     |
| But     |         | you    | don't     | really           | semicolon |
| |care|  | for     | music  | do        | ||||| you |||||? | semicolon |
| ,It,    | ,goes,  | ,like, | ,this,    | ,the,            | pipeline  |
| fourth  | the     | fifth  | The       | ;minor           | pipeline  |
| fall    | the     | ;major | lift      | The              | pipeline  |
| baffled | king    |        | composing | hallelujah       | pipeline  |
+---------+---------+--------+-----------+------------------+-----------+