我尝试使用已上传到HDFS目录的CSV在Impala中创建表格。 CSV包含用引号括起来的逗号的值。
示例:
1.66.96.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.66.128.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.0.0/17,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.128.0/18,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
1.67.192.0/19,"NTT Docomo,INC.","Ntt Docomo",9605,"NTT DOCOMO, INC."
Impala documentation表示可以使用ESCAPED BY
子句解决此问题。这是我目前的代码:
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
我也尝试过使用ESCAPED BY '"'
子句。在这两种情况下,Impala都使用引号中的逗号并将其用作分隔符,将值拆分为两列。
有关如何修复代码的任何想法,以免这种情况发生?
编辑(2015年6月9日)
所以,根据@K S Nidhin和@JTUP的建议,我已经完成了以下变化。但是,每个变体都返回与没有SERDEPROPERTIES
运算符的查询相同的结果,逗号仍然会导致值显示在错误的列中:
变体1
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
WITH SERDEPROPERTIES ( "quoteChar" = "'", "escapeChar" = "\\" )
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
变体2
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
WITH SERDEPROPERTIES ( 'quoteChar' = '"', 'escapeChar' = '\\' )
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
变体3
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
要尝试的SERDEPROPERTIES
运算符的其他任何想法或其他变体吗?
编辑(2016年10月6日)
我能够使用SERDE
和SERDEPROPERTIES
运算符获取查询的不同变体,以便在Hive中工作(基于Hive Documentation中提供的代码),并使用正确的表格被创造:
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4(network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'separatorChar' = ',',
'quoteChar' = '"',
'escapeChar' = '\\'
)
STORED AS TEXTFILE;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
由于Impala中的SERDE
运算符不可用,因此该解决方案无法在那里运行。我可以在Hive中创建表格,但我仍然无法在Impala中找到可行的解决方案。
答案 0 :(得分:0)
DROP TABLE IF EXISTS GeoIP2_ISP_Blocks_IPv4;
CREATE TABLE GeoIP2_ISP_Blocks_IPv4 (
network STRING
,isp STRING
,organization STRING
,autonomous_system_number STRING
,autonomous_system_organization STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
LOCATION 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/';
INVALIDATE METADATA GeoIP2_ISP_Blocks_IPv4;
LOAD DATA INPATH 'hdfs://.../GeoIP2_ISP_Blocks_IPv4/'
INTO TABLE GeoIP2_ISP_Blocks_IPv4;
添加SERDEPROPERTIES,希望能够做到这一点
答案 1 :(得分:0)
我要做的是首先将定界符从逗号转换为其他字符,例如竖线('|')。 您可以在Linux上使用csvformat(csvkit的一部分)。
csvformat -D \| input_filename.csv > input_filename-pipe.csv
然后,将定界符设置为“ |”在impala查询中
TERMINATED BY '|'