在pig中合并和覆盖数据集

时间:2015-09-23 17:57:39

标签: hadoop join merge apache-pig

我有3组数据格式(acctid:chararray,rule:chararray,value:charrarray)

设置1个文件:

123;R1;r1 version set 1 123
123;R2;r2 version set 1 123
123;R3;r3 version set 1 123
124;R1;r1 version set 1 124
124;R2;r2 version set 1 124
124;R3;r3 version set 1 124

设置2文件://更改R2

123;R2;r2 version set 2 123
124;R2;r2 version set 2 124

设置3档:

123;R4;r4 version set 3 123
124;R4;r4 version set 3 124

我需要合并数据,以便:

  • 在第一个数据集中,R2值变为第二组中的值

  • R3值被删除

  • 添加R4值

然后我可以通过帐户ID进行分组并获取:

最后:

123;R1;r1 version set 1 123
123;R2;r2 version set 2 123
123;R4;r4 version set 3 123
124;R1;r1 version set 1 124
124;R2;r2 version set 2 124
124;R4;r4 version set 3 124

我尝试了各种连接和合并,但我不明白这是否可能。感谢

1 个答案:

答案 0 :(得分:1)

尝试此操作,它将提供所需的输出

<select>

输出

set_1 = LOAD '/home/abhis/set_1' USING PigStorage(';') AS (acctid:chararray, rule: chararray, value: chararray);
set_2 = LOAD '/home/abhis/set_2' USING PigStorage(';') AS (acctid:chararray, rule: chararray, value: chararray);
set_3 = LOAD '/home/abhis/set_3' USING PigStorage(';') AS (acctid:chararray, rule: chararray, value: chararray);


DATA_SET1 = FILTER set_1 BY (rule matches '.*R1.*');

DATA_SET2 = UNION DATA_SET1,set_2,set_3;
DATA_SET3 = ORDER DATA_SET2 by acctid,rule;
dump DATA_SET3;