在较旧的csv文件和较新的csv文件之间,我想查找具有相同键的行上已更改的字段。例如,如果唯一键位于字段$ 2中,并且我们有两个文件:
Old csv file: FIELD1,FIELD2,ID,FIELD4 a,a,key1,a b,b,key2,b New csv file: FIELD1,FIELD2,ID,FIELD4 a,a2,key1,a2 b,b,key2,b
所需的输出类似于:
{FIELD2:a2,ID:key1,FIELD4:a2}
或换句话说,在ID = key1的行上,第2和第4个字段发生了变化,这些是更改后的值。
如果任何字段发生变化,输出整行的猪脚本是:
old = load '$old' using PigStorage('\n') as (line:chararray);
new = load '$new' using PigStorage('\n') as (line:chararray);
cg = cogroup old by line, new by line;
new_only = foreach (filter cg by IsEmpty(old)) generate flatten(new);
store new_only into '$changes';
我最初的想法(我不知道如何完成它)是:
old = LOAD $old USING PigStorage('|');
new = LOAD $new USING PigStorage('|');
cogroup_data = COGROUP old by $2, new by $2 -- 3rd column is unique key
diff_data = FOREACH cogroup_data GENERATE DIFF(old,new);
-- ({(a,a,key2,a),(a,a2,key2,a2)})
-- ? what goes here ?