我有两个文件,比如
档案1
id,sal,location,code
1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
文件2
id,sal,location,code
1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
当我将file1与文件2进行比较时,我需要一个像
这样的输出1000, sal
1001,code
基本上,它应该告诉我哪个字段与前一个文件一起更改以及id。 可以在PIG中完成。
答案 0 :(得分:0)
您可以轻松解决此问题,但具有挑战性的部分将是您提到的输出格式。获取输出格式需要一点复杂的逻辑。
我修复了大多数边缘情况,但您可以检查输入以确保它适用于所有组合。
<强>文件1:强>
1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
<强> file2的:强>
1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F
<强> PigScript:强>
A = LOAD 'file1' USING PigStorage(',') AS (id,sal,location,code);
B = LOAD 'file2' USING PigStorage(',') AS (id,sal,location,code);
C = JOIN A BY id,B BY id;
D = FOREACH C GENERATE A::id AS id,((A::sal == B::sal)?'':'sal') AS sal,
((A::location == B::location)?'':'location') AS location,
((A::code == B::code)?'':'code') AS code;
--Remove the common fields between two files
E = FILTER D BY NOT (sal=='' AND location=='' AND code=='');
--The below two lines are used to formatting the output
F = FOREACH E GENERATE id,REPLACE(BagToString(TOBAG(sal,location,code),','),'(,,$|,$)','') As finalOutput;
G = FOREACH F GENERATE id,REPLACE(finalOutput,',,',',');
DUMP G;
<强>输出:强>
(1000,sal)
(1001,code)