如何避免两个字段的相同加入?

时间:2013-01-18 06:07:34

标签: apache-pig

我承认这个问题的标题不明确。如果有人在阅读我的问题后可以改写它,那就太棒了。

无论如何,我有一对字段ID的字段。现在我想用他们的文字替换它们。现在我正在进行两次加入和预告,如下所示:

WordIDs = LOAD wordID.txt AS (wordID1:long, wordID2:long);
WordTexts = LOAD wordText.txt AS (wordID:long, wordText:chararray);

Join1 = JOIN WordIDs BY wordID1, WordTexts BY wordID;
Replaced1 = FOREACH Join1 GENERATE WordTexts::wordText As wordText1, WordIDs::wordID2;

Join2 = JOIN Replaced1 BY wordID2, WordTexts BY wordID;
Replaced2 = FOREACH Join2 GENERATE Replaced1::wordText1 As wordText1, WordTexts::wordText::wordText2;

有没有办法用较少数量的语句(比如一个连接而不是两个连接)来执行此操作?

1 个答案:

答案 0 :(得分:1)

我认为您当前的代码将生成2个单独的map reduce作业,以避免它使用复制的join,它不会改变join语句的数量,但只使用一个map side join,只有一个map reduce作业。代码看起来应该是这样的(我还没有运行它):

WordIDs = LOAD wordID.txt AS (wordID1:long, wordID2:long);
WordTexts = LOAD wordText.txt AS (wordID:long, wordText:chararray);

Join1 = JOIN WordIDs BY wordID1, WordTexts BY wordID USING 'replicated';
Join2 = JOIN Join1 BY wordID2, WordTexts BY wordID USING 'replicated';

Replaced = FOREACH Join2 GENERATE Join1::WordTexts::wordText As wordText1, Join2::wordTexts::wordText as wordText2;