如何比较PIG中的两个CSV文件,并仅输出不同的列?

时间:2016-10-13 17:03:33

标签: csv apache-pig

在较旧的csv文件和较新的csv文件之间,我想查找具有相同键的行上已更改的字段。例如,如果唯一键位于字段$ 2中,并且我们有两个文件:

Old csv file: 
     FIELD1,FIELD2,ID,FIELD4 
     a,a,key1,a 
     b,b,key2,b

New csv file:  
     FIELD1,FIELD2,ID,FIELD4
     a,a2,key1,a2 
     b,b,key2,b 

所需的输出类似于:

     {FIELD2:a2,ID:key1,FIELD4:a2}

或换句话说,在ID = key1的行上,第2和第4个字段发生了变化,这些是更改后的值。

如果任何字段发生变化,输出整行的猪脚本是:

old = load '$old' using PigStorage('\n') as (line:chararray);
new = load '$new' using PigStorage('\n') as (line:chararray);
cg = cogroup old by line, new by line;
new_only = foreach (filter cg by IsEmpty(old)) generate flatten(new);
store new_only into '$changes';

我最初的想法(我不知道如何完成它)是:

old = LOAD $old USING PigStorage('|');
new = LOAD $new USING PigStorage('|');
cogroup_data = COGROUP old by $2, new by $2 -- 3rd column is unique key
diff_data = FOREACH cogroup_data GENERATE DIFF(old,new);
-- ({(a,a,key2,a),(a,a2,key2,a2)})
-- ? what goes here ?

0 个答案:

没有答案