我有两个文件,一个名为a-records
123^record1
222^record2
333^record3
和另一个名为b-records的文件
123^jim
123^jim
222^mike
333^joe
你可以在文件A中看到我有一次令牌123。在文件B中,它在那里两次。有没有办法使用Apache PIG我可以加入数据,这样我只能从A文件中获得一条连接记录?
这是我当前的脚本,它输出以下内容
arecords = LOAD '$a' USING PigStorage('^') as (token:chararray, type:chararray);
brecords = LOAD '$b' USING PigStorage('^') as (token:chararray, name:chararray);
x = JOIN arecords BY token, brecords BY token;
dump x;
产生:
(123,record1,123,jim)
(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)
当我真正想要的是(通知令牌123仅在加入后一次)
(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)
任何想法?非常感谢
答案 0 :(得分:4)
我会做这样的事情:
arecords = LOAD '$a' USING PigStorage('^') as (token:chararray, type:chararray);
brecords = LOAD '$b' USING PigStorage('^') as (token:chararray, name:chararray);
bdistinct = DISTINCT brecords;
x = JOIN arecords BY token, bdistinct BY token;
dump x;