Question

我有这种格式的数据。

"123";"mybook1";"2002";"publisher1";
"456";"mybook2;the best seller";"2004";"publisher2";
"789";"mybook3";"2002";"publisher1";

字段括在＆＃34;＆＃34;并被划定; 书名也可能包含＆＃39 ;;＆＃39;介于两者之间。

您能告诉我如何将这些数据从文件加载到配置表

我现在使用的以下查询现在显然无效;

create table books (isbn string,title string,year string,publisher string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;'

如果可能，我希望将userid和year字段存储为Int。请帮忙

谢谢，哈里什

Answer 1

你遗失的是RegexSerDe。从输入中只插入一部分文本非常有用。您的DDL如下：

create table books ( isbn string, title string, year string, publisher string ) 
  ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
  WITH SERDEPROPERTIES  (
     "input.regex" = "(?:\")(\\d*)(?:\"\;\")([^\"]*)(?:\"\;\")(\\d*)(?:\"\;\")([^\"]*)\"(?:\;)" ,
     "output.format.string" = "%1$s %2$s %3$s %4$s"
    )
  STORED AS TEXTFILE;

由于逃逸和非捕获组，正则表达式在第一眼看上去可能看起来很复杂。实际上它包含2组(\d*)＆amp; ([^"]*)交替放置两次。非捕获组（(?:)只是帮助删除不必要的上下文。组([^"]*)也在bookName字段中处理';'。

但没有任何成本。尽管有其所有功能，RegexSerDe仅支持字符串字段。您所能做的就是在从表中选择数据时调用默认配置单元UDF cast进行转换。例如（实际语法可能有所不同）：

 SELECT cast( year as int ) from books;

希望这有帮助。

使用Hive从文件中收集数据

1 个答案: