我正面临与 SERDE 引用删除相关的问题。
我有桌子跟踪器。我必须从所有列中删除双引号,但必须跳过包含json( Product )的列。当我从CSV文件加载数据时,它也会从 json数据中删除引号。
CREATE EXTERNAL TABLE IF NOT EXISTS TRACKER
(
SUBSCRIBER STRING,
SERIAL STRING,
PRODUCT STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "\"",
"escapeChar" = "\\"
) STORED AS TEXTFILE
LOCATION '/user/tracker'
tblproperties ("skip.header.line.count"="1");
csv中的示例数据
"Raj","400000",{"newData":"d0","olddata":"test1"}
"Rai","400332",{"newData":"data1","olddata":"test2"}
"Ram","444000",{"newData":"New Data","olddata":"test3"}
适用于前2列 SUBSCRIBER 和 SERIAL ,但对于最后一个字段产品,它也会从json中删除引号。
答案 0 :(得分:2)
<强> RegexSerDe 强>
create external table if not exists tracker
(
subscriber string
,serial string
,product string
)
row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
with serdeproperties ('input.regex' = '"(.*?)","(.*?)",(.*)')
tblproperties ("skip.header.line.count"="1")
;
select * from tracker
;
+--------------------+----------------+---------------------------------------+
| tracker.subscriber | tracker.serial | tracker.product |
+--------------------+----------------+---------------------------------------+
| Raj | 400000 | {"newData":"d0","olddata":"test1"} |
| Rai | 400332 | {"newData":"data1","olddata":"test2"} |
+--------------------+----------------+---------------------------------------+