FIles在PIG中逐场比较

时间:2015-04-07 17:09:14

标签: hadoop mapreduce apache-pig

我有两个文件,比如

档案1

id,sal,location,code
1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F

文件2

id,sal,location,code
1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F

当我将file1与文件2进行比较时,我需要一个像

这样的输出
1000, sal
1001,code

基本上,它应该告诉我哪个字段与前一个文件一起更改以及id。 可以在PIG中完成。

1 个答案:

答案 0 :(得分:0)

您可以轻松解决此问题,但具有挑战性的部分将是您提到的输出格式。获取输出格式需要一点复杂的逻辑。

我修复了大多数边缘情况,但您可以检查输入以确保它适用于所有组合。

<强>文件1:

1000,1000,jupiter,F
1001,2000,jupiter,F
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F

<强> file2的:

1000,2000,jupiter,F
1001,2000,jupiter,Z
1002,3000,jupiter,F
1003,4000,jupiter,F
1004,5000,jupiter,F

<强> PigScript:

    A = LOAD 'file1' USING PigStorage(',') AS (id,sal,location,code);
    B = LOAD 'file2' USING PigStorage(',') AS (id,sal,location,code);
    C = JOIN A BY id,B BY id;
    D = FOREACH C GENERATE A::id AS id,((A::sal == B::sal)?'':'sal') AS sal,
                                       ((A::location == B::location)?'':'location') AS location,
                                       ((A::code == B::code)?'':'code') AS code;

    --Remove the common fields between two files    
    E = FILTER D BY NOT (sal=='' AND location=='' AND code=='');

    --The below two lines are used to formatting the output 
    F = FOREACH E GENERATE id,REPLACE(BagToString(TOBAG(sal,location,code),','),'(,,$|,$)','') As finalOutput;
    G = FOREACH F GENERATE id,REPLACE(finalOutput,',,',',');
    DUMP G;

<强>输出:

(1000,sal)
(1001,code)