如何在配置单元中使用多个分隔符

时间:2015-02-01 07:35:47

标签: hive delimiter

我有一个这样的输入数据集,

“UserID”|“州”,“城市”,“国家”|“区号”

“203448”|“aylesbury,n / a,英国”| \ N

这里两个,和|充当分隔符

如何在hive中创建表时使用这两个分隔符。

1 个答案:

答案 0 :(得分:1)

我建议将输入文件的每一行完整地摄取到具有单个字符串列的登台表中,然后使用将在逗号和管道上键入的正则表达式拆分每个输入行。例如:

DROP TABLE IF EXISTS staging;
CREATE TABLE staging (rawdata STRING);
LOAD DATA LOCAL INPATH 'test.data' INTO TABLE staging;
-- I put your data into a local file called "test.data" - change your path accordingly

因此,使用您的数据,临时表现在看起来像:

hive> SELECT * FROM staging;
OK
"UserID"|"State","City","Country"|"Area Code"
"203448"|"aylesbury, n/a, united kingdom"|\N
Time taken: 0.452 seconds, Fetched: 2 row(s)

然后你可以创建你的决赛桌(我任意命名为“target”,用你自己的名字替换):

DROP TABLE IF EXISTS target;
CREATE TABLE target AS SELECT
  i[0] AS columnNameA,
  i[1] AS columnNameB,
  i[2] AS columnNameC,
  i[3] AS columnNameD,
  i[4] AS columnNameE
FROM (SELECT split(rawdata, ",|\\|") AS i FROM staging) t;

将列名替换为所需的列标题。在任何情况下,这都是创建后目标表的结果内容(我通过sed传送显示的结果,用::而不是制表符分隔字段,我发现这些字符不可读):

# hive -e "select * from target" 2>/dev/null | sed 's/\t/ :: /g'
"UserID" :: "State" :: "City" :: "Country" :: "Area Code"
"203448" :: "aylesbury ::  n/a ::  united kingdom" :: NULL