我想为下面的查询编写一个猪脚本。
输入是:
AAA,,,
,BBB,,
,,,DDD
AAA,,,
,BBB,,
,,CCC,
,,,DDD
AAA,,,
,BBB,,
,,,DDD
输出应为:
AAA,BBB,,DDD
AAA,BBB,CCC,DDD
AAA,BBB,,DDD
我试过Merge two lines in Pig但是如果我试图拆分Bag Baglit(3,$ 1)然后输出不正确,因为我的输出将合并前三行然后接下来的四行再次接下来的三行线
输入可能会增加,但最后一行的一个重要事情始终是,,, DDD。
有人可以帮助我吗?
答案 0 :(得分:0)
您的输入数据应分成不同的长度(3,4,3),因此BagSplit
函数在这种情况下不起作用。你能尝试以下方法吗?关系E (TOTUPLE)
的重复部分可以使用MACROS
进一步优化,但会导致更多的混淆,因此我目前还没有优化。
<强> input.txt中强>
AAA,,,
,BBB,,
,,,DDD
AAA,,,
,BBB,,
,,CCC,
,,,DDD
AAA,,,
,BBB,,
,,,DDD
<强> PigScript:强>
A = LOAD 'input.txt' USING PigStorage(',') AS(f1,f2,f3,f4);
B = RANK A;
C = GROUP B ALL;
D = FOREACH C {
firstRecord = FILTER B BY rank_A<=3; /* store first 3 records*/
secondRecord= FILTER B BY rank_A>3 AND rank_A<=7; /* store next 4 records */
thirdRecord = FILTER B BY rank_A>7; /* store next 3 records */
GENERATE firstRecord,secondRecord,thirdRecord;
}
/* Convert each split bags(firstRecord,secondRecord and thirdRecord) into strings and replace 'null' and '_' with empty characters.*/
E = FOREACH D GENERATE FLATTEN(TOBAG(
TOTUPLE(REPLACE(BagToString(firstRecord.f1),'[null|_]',''),
REPLACE(BagToString(firstRecord.f2),'[null|_]',''),
REPLACE(BagToString(firstRecord.f3),'[null|_]',''),
REPLACE(BagToString(firstRecord.f4),'[null|_]','')),
TOTUPLE(REPLACE(BagToString(secondRecord.f1),'[null|_]',''),
REPLACE(BagToString(secondRecord.f2),'[null|_]',''),
REPLACE(BagToString(secondRecord.f3),'[null|_]',''),
REPLACE(BagToString(secondRecord.f4),'[null|_]','')),
TOTUPLE(REPLACE(BagToString(thirdRecord.f1),'[null|_]',''),
REPLACE(BagToString(thirdRecord.f2),'[null|_]',''),
REPLACE(BagToString(thirdRecord.f3),'[null|_]',''),
REPLACE(BagToString(thirdRecord.f4),'[null|_]',''))
)
);
DUMP E;
<强>输出:强>
(AAA,BBB,,DDD)
(AAA,BBB,CCC,DDD)
(AAA,BBB,,DDD)