如何在Apache PIG中正确进行内连接?

时间:2011-10-17 06:23:54

标签: java hadoop apache-pig

我有两个文件,一个名为a-records

123^record1
222^record2
333^record3

和另一个名为b-records的文件

123^jim
123^jim
222^mike
333^joe

你可以在文件A中看到我有一次令牌123。在文件B中,它在那里两次。有没有办法使用Apache PIG我可以加入数据,这样我只能从A文件中获得一条连接记录?

这是我当前的脚本,它输出以下内容

arecords = LOAD '$a'  USING PigStorage('^')  as (token:chararray, type:chararray);

brecords =  LOAD '$b'  USING PigStorage('^')  as (token:chararray, name:chararray);


x = JOIN arecords BY token, brecords BY token;

dump x;

产生:

(123,record1,123,jim)
(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)

当我真正想要的是(通知令牌123仅在加入后一次)

(123,record1,123,jim)
(222,record2,222,mike)
(333,record3,333,joe)

任何想法?非常感谢

1 个答案:

答案 0 :(得分:4)

我会做这样的事情:

arecords = LOAD '$a'  USING PigStorage('^')  as (token:chararray, type:chararray);

brecords =  LOAD '$b'  USING PigStorage('^')  as (token:chararray, name:chararray);

bdistinct = DISTINCT brecords;

x = JOIN arecords BY token, bdistinct BY token;

dump x;