我有如下文字文件:
.col-lg-4 {
box-sizing: border-box;
width: calc(33.3% - 10px);
margin: 5px;
}
表结构如下:
1,"TEST"Data","SAMPLE DATA"
当我将文件放在相关的HDFS位置时。第2和第03列填充为CREATE TABLE test1( id string, col1 string , col2 string )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'mylocation/test1'`
,因为它们之间有双引号(TEST“数据”)。
一种方法是使用转义字符“/”更新数据文件,但我们不允许更新传入的数据。如何正确加载数据并转义这些嵌入的双引号。
感谢帮助!!
答案 0 :(得分:2)
您可以使用 RegexSerDe
加载它<强>的bash 强>
mkdir test1
cat>test1/file.txt
1,"TEST"Data","SAMPLE DATA"
2,"TEST Data","SAMPLE DATA"
3,"TEST","Data","SAMPLE","DATA"
hdfs dfs -put test1 /tmp
<强>蜂房强>
create external table test1
(
id string
,col1 string
,col2 string
)
row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
with serdeproperties
(
'input.regex' = '^(\\d+?),"(.*)","(.*)"$'
)
location '/tmp/test1'
;
select * from test1
;
+----------+----------------------+-------------+
| test1.id | test1.col1 | test1.col2 |
+----------+----------------------+-------------+
| 1 | TEST"Data | SAMPLE DATA |
| 2 | TEST Data | SAMPLE DATA |
| 3 | TEST","Data","SAMPLE | DATA |
+----------+----------------------+-------------+