如何使用SerDe从HIVE中的列中删除引用时从表中跳过列

时间:2017-09-04 13:52:41

标签: hadoop hive

我正面临与 SERDE 引用删除相关的问题。

我有桌子跟踪器。我必须从所有列中删除双引号,但必须跳过包含json( Product )的列。当我从CSV文件加载数据时,它也会从 json数据中删除引号。

CREATE EXTERNAL TABLE IF NOT EXISTS TRACKER
(
SUBSCRIBER STRING,
SERIAL STRING,
PRODUCT STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = ",",
   "quoteChar"     = "\"",
   "escapeChar"    = "\\"
)  STORED AS TEXTFILE
LOCATION '/user/tracker'
tblproperties ("skip.header.line.count"="1");

csv中的示例数据

"Raj","400000",{"newData":"d0","olddata":"test1"}
"Rai","400332",{"newData":"data1","olddata":"test2"}
"Ram","444000",{"newData":"New Data","olddata":"test3"}

适用于前2列 SUBSCRIBER SERIAL ,但对于最后一个字段产品,它也会从json中删除引号。

1 个答案:

答案 0 :(得分:2)

<强> RegexSerDe

create external table if not exists tracker
(
    subscriber  string
   ,serial      string
   ,product     string
)
    row format serde 'org.apache.hadoop.hive.serde2.RegexSerDe'
    with serdeproperties ('input.regex' = '"(.*?)","(.*?)",(.*)')
    tblproperties ("skip.header.line.count"="1")
;
select * from tracker
;
+--------------------+----------------+---------------------------------------+
| tracker.subscriber | tracker.serial |            tracker.product            |
+--------------------+----------------+---------------------------------------+
| Raj                |         400000 | {"newData":"d0","olddata":"test1"}    |
| Rai                |         400332 | {"newData":"data1","olddata":"test2"} |
+--------------------+----------------+---------------------------------------+