apache pig加载带有多个分隔符的数据

时间:2016-03-25 12:21:26

标签: apache-pig delimiter

大家好我对使用apache pig加载数据有疑问,文件格式如下:

"1","2","xx,yy","a,sd","3"

所以我想通过使用多个分隔符"," 2个双引号和一个逗号来加载它:

A = LOAD 'file.csv' USING PigStorage('","') AS (f1,f2,f3,f4,f5);

但是PigStorage不接受多个分隔符","。我怎么能这样做?非常感谢你!

1 个答案:

答案 0 :(得分:0)

PigStorage将单个字符作为分隔符。您将使用PiggyBank中的内置函数。下载piggybank.jar并保存在与pigcript相同的文件夹中。在你的pigcript中保存jar。

REGISTER piggybank.jar;

DEFINE CSVLoader org.apache.pig.piggybank.storage.CSVLoader();

A = LOAD 'test1.txt' USING CSVLoader(',') AS (f1:int,f2:int,f3:chararray,f4:chararray,f5:int);
B = FOREACH A GENERATE f1,f2,f3,f4,f5;
DUMP B;

备用选项是将数据加载到一行中,然后使用STRSPLIT

A = LOAD 'test1.txt' USING TextLoader() AS (line:chararray);
B = FOREACH A GENERATE FLATTEN(STRSPLIT(line, '","'));
DUMP B;