更改分隔符并通过PiG生成输出

时间:2016-02-11 13:12:58

标签: apache-pig

我想从给定的输入生成以下输出。什么是最好的方法。



Input:-
"column,1A,extra-A1,extra-A2",column2A,column3A
"((column,1B,extra-B1))",column2B,column3B
"column,1C,extra-C1,extra-C2,extra-C3,extra-C4",column2C,column3C
"column,1D,extra-D1",column2D,column3D


Output:-
column,1A,extra-A1,extra-A2|column2A|column3A
((column,1B,extra-B1))|column2B|column3B
column,1C,extra-C1,extra-C2,extra-C3,extra-C4|column2C|column3C
column,1D,extra-D1|column2D|column3D




3 个答案:

答案 0 :(得分:1)

我可以使用下面的方法解决它,如果你有更好的选择,请告诉我



Input:-
"column,1A,extra-A1,extra-A2",column2A,column3A
"((column,1B,extra-B1))",column2B,column3B
"column,1C,extra-C1,extra-C2,extra-C3,extra-C4",column2C,column3C
"column,1D,extra-D1",column2D,column3D

Pig Script:-
A = LOAD '/home/hduser/pig_ex1/sample1.txt' AS line;
B = FOREACH A GENERATE SUBSTRING(line,1,(LAST_INDEX_OF(line,'"'))) AS firstcol, SUBSTRING(line,(LAST_INDEX_OF(line,'"')+2),(INT) SIZE(line)) as lastcol;
C = FOREACH B GENERATE firstcol, FLATTEN(STRSPLIT(lastcol,'\\,',2)) AS (secondcol,thirdcol);
D = FOREACH C GENERATE CONCAT(firstcol,'|',secondcol,'|',thirdcol);

Output:-
(column,1A,extra-A1,extra-A2|column2A|column3A)
(((column,1B,extra-B1))|column2B|column3B)
(column,1C,extra-C1,extra-C2,extra-C3,extra-C4|column2C|column3C)
(column,1D,extra-D1|column2D|column3D)




答案 1 :(得分:0)

尝试org.apache.pig.piggybank.storage.CSVExcelStorage(',');来自皮球罐。

答案 2 :(得分:0)

我正在使用正则表达式

试试这个代码

a = LOAD '/home/hduser/pig_ex1/sample1.txt' as line;
b = FOREACH a GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'["](.*)["][,](.*)[,](.*)'))  AS (f1,f2,f3);

c = FOREACH b GENERATE CONCAT(f1,'|',f2,'|',f3);

dump c;