Question

我有一个类似的CSV文件：

“” 9998 “”， “” 714144 “”; “” frwiki-20131107-页 - 间history2.xml “”; “” Ripchip 博特“” “” 10000 “”， “” 195090 “”， “” frwiki-20131107-页面-元history2.xml “”， “” TXiKiBoT “” “” 10002 “”; “” 265154 “”; “” frwiki-20131107-页 - 间history2.xml “”; “” Jimmy44 “”

我尝试用它创建一个外部表：

CREATE EXTERNAL TABLE titi(username string,id int, revisionid int, fileName string) row format serde 'com.bizo.hive.serde.csv.CSVSerde'
    with serdeproperties("separatorChar" = "\;"
    , "quoteChar" = "\"\"")
    stored as textfile
    LOCATION '/contributor';

但结果是我：

hive> select * from titi limit 10;
OK
��"9998""       ""714144""      ""frwiki-20131107-pages-meta-history2.xml""     ""Ripchip Bot""
        NULL    NULL    NULL
"10000""        ""195090""      ""frwiki-20131107-pages-meta-history2.xml""     ""TXiKiBoT""
        NULL    NULL    NULL
"10002""        ""265154""      ""frwiki-20131107-pages-meta-history2.xml""     ""Jimmy44""
        NULL    NULL    NULL
"10004""        """"    ""frwiki-20131107-pages-meta-history2.xml""     """"
        NULL    NULL    NULL
"10006""        ""1046395""     ""frwiki-20131107-pages-meta-history2.xml""     ""LoveBot""
        NULL    NULL    NULL

我的表创建语法错了吗？

Answer 1

我已经复制了您的问题并且可以确认。

但在我看来，Serde按预期工作，在这种情况下无法帮助你。因为quoteChar接受一个字符而不是一个字符串，它设法删除一个双引号，但不是第二个。

如果能够接受字符串而不是char作为参数，那么你可以用它删除双引号。

我认为您必须使用Regex Serde加载文件（see an example here），或者直接在Hive中进行清理后加载。

编辑：我刚刚在该问题上开了一张票in GitHub

编辑2：我有一个使用Regex Serde的解决方案，而不是你今天要看到的最美丽的东西，但它有效（只要你的字符串中没有双引号）：

CREATE TABLE titi (
  field1 STRING,
  field2 STRING,
  field3 STRING,
  field4 STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
"input.regex" = "\"\"([^\"]*)\"\"\;\"\"([^\"]*)\"\"\;\"\"([^\"]*)\"\"\;\"\"([^\"]*)\"\"",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS TEXTFILE;

使用以下正则表达式（没有转义符号）：“”（[^“] ）”“;”“（[^”] ）“”;“”（[^“ ] ） “”; “”（[^ “] ）”，“

Answer 2

我在带有 csv SerDe 的 Hive 3.x 版本中遇到了类似的问题。以下解决我的问题（示例）：

0   2021-01-01 13:00:00
1   2021-01-01 14:00:00
2   2021-01-01 17:30:00
Name: Time, dtype: datetime64[ns]

准确地说，quoteChar 值中应该有一个反斜杠，如上所示。

HIVE - quoteChar serde不起作用

2 个答案: