将CSV加载到Impala的外部表中时如何删除双引号?

时间:2018-12-01 21:39:44

标签: csv impala

这是数据(您也可以从here下载):

"Creation Date","Status","First 3 Chars of Postal Code","Intersection Street 1","Intersection Street 2","Ward","Service Request Type","Division","Section"
"2010-01-01 00:38:26.0000000","Closed","Intersection","High Park Blvd","Parkside Dr","Parkdale-High Park (13)","Road - Sanding / Salting Required","Transportation Services","Road Operations"
"2010-01-01 01:19:18.0000000","Closed","M4T","","","Toronto Centre-Rosedale (27)","Water Service Line-Turn On","Toronto Water","District Ops"

这是我的创建表查询:

CREATE TABLE sr.sr2013 ( 
creation_date STRING,   
status STRING,   
first_3_chars_of_postal_code STRING,   
intersection_street_1 STRING,   
intersection_street_2 STRING,   
ward STRING,   
service_request_type STRING,   
division STRING,   
section STRING ) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
WITH SERDEPROPERTIES (
'colelction.delim'='\u0002', 
'mapkey.delim'='\u0003', 
'serialization.format'=',', 
'field.delim'=',', 
'skip.header.line.count'='1',
'quoteChar'= "\"") ;

这是加载数据查询:

load data inpath '/user/rxie/SR2013.csv' into table sr2013;

加载数据后,检查表是否保留了所有原始引号:

enter image description here

所以这里至少有两个问题: 1.表创建中的选项'skip.header.line.count'='1',不排除标题; 2.在将数据加载到表中时,不会如选项'quoteChar'= "\""所示删除双引号

任何人都可以分享更多的光吗?在我看来,这就像是虫子。

更新1:

在Hue / Hive编辑器中:

creation_date STRING,   
status STRING,   
first_3_chars_of_postal_code STRING,   
intersection_street_1 STRING,   
intersection_street_2 STRING,   
ward STRING,   
service_request_type STRING,   
division STRING,   
section STRING )                               
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES (                             
   'colelction.delim'='\u0002',                     
   'field.delim'=',',                               
   'mapkey.delim'='\u0003',                         
   'serialization.format'=',',
   'skip.header.line.count'='1',   
   'quoteChar'= "\"") 


   LOAD DATA LOCAL INPATH '/home/rxie/data/csv/SR2015.csv' INTO TABLE sr2015;  

错误:

  

编译语句时出错:失败:SemanticException行1:26   无效的路径``/home/rxie/data/csv/SR2015.csv'':没有文件匹配   路径文件:/home/rxie/data/csv/SR2015.csv

1 个答案:

答案 0 :(得分:0)

以下是我在加载csv时要排除引号的方法,如下所示:

在Hive编辑器中(我认为beeline也不错,尽管我没有对其进行测试):

  1. 创建Hive表

    创建外部表sr2015(
    creation_date STRING,
    状态STRING,
    first_3_chars_of_postal_code STRING,
    junction_street_1 STRING,
    junction_street_2 STRING,
    病房STRING,
    service_request_type STRING,
    部门STRING,
    STRING部分)
    行格式SERDE'org.apache.hadoop.hive.serde2.OpenCSVSerde' 带有SERDEPROPERTIES(
       'colelction.delim'='\ u0002',
       'field.delim'=',',
       'mapkey.delim'='\ u0003',
       'serialization.format'=',',    'skip.header.line.count'='1',
       'quoteChar'=“ \”“)

  2. 将数据加载到Hive表中:

    LOAD DATA INPATH“ hdfs:///user/rxie/SR2015.csv”插入表sr2015;

有待解决的问题(将在here中进行讨论): 无法在Impala

中访问该表