如何在Pig相同的模式中连接2个数据集

时间:2016-03-15 06:56:55

标签: hadoop join mapreduce tuples apache-pig

您好我在Pig编程方面相对较新,遇到了一个我很难解决的问题:

我有2个数据集

答:( accountId:chararray,标题:chararray,类型:chararray)

("A123", "Harry Potter", "Action/Adventure")
("A123", "Sherlock Holmes", "Mystery")
("B456", "James Bond", "Action")
("B456", "Hamlet", "Drama")

B :( accountId:chararray,title:chararray,genre:chararray)

("B456", "Percy Jackson", "Action/Adventure")
("B456", "Elementary", "Mystery")
("A123", "Divergent", "Action")
("A123", "Downton Abbey", "Drama")

我想要的结果应该是

(accountId:charray,{(),(),...}

(A123, {("A123", "Harry Potter", "Action/Adventure"),
        ("A123", "Sherlock Holmes", "Mystery"),
        ("A123", "Divergent", "Action"),
        ("A123", "Downton Abbey", "Drama")
        })

(B456, {("B456", "James Bond", "Action"),
        ("B456", "Hamlet", "Drama"),
        ("B456", "Percy Jackson", "Action/Adventure"),
        ("B456", "Elementary", "Mystery")
        })

目前我在做:

ANS =加入BY accountId,B BY accountId;

但结果看起来像

SCHEMA :( accountId:chararray,{(accountId:chararray,title:chararray,genre:chararray),...})

(B456, {("B456", "James Bond", "Action"),
        ("B456", "Hamlet", "Drama")}
       "B456", {
        ("B456", "Percy Jackson", "Action/Adventure"),
        ("B456", "Elementary", "Mystery")
        })

知道我可能做错了什么。

1 个答案:

答案 0 :(得分:1)

试试这个:

-- IMPORTANT: register datafu.jar
define BagConcat datafu.pig.bags.BagConcat();
A = load 'A' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);      
B = load 'B' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);   
C = cogroup A by id, B by id;
D = foreach C generate BagConcat(A, B);
dump D;

JOIN只是简单地从两个关系中加入行。你想完成两件事:

  • 对每个关系中属于同一帐户的所有行进行分组
  • 加入两个“分组”关系(仅获取两个关系中存在的ID)

这两项行动由COGROUP执行。我读到的最好的解释是:http://joshualande.com/cogroup-in-pig/

您的关系现在将包含组密钥(ID)和两个包(一个来自A,一个来自B),每个包含原始关系中的行;将它们“合并”成一个包的方法是使用datafu.jar中的BagConcat函数。 datafu是一个PIG UDF库,里面装满了好东西。您可以在此处阅读:http://datafu.incubator.apache.org/docs/datafu/guide/bag-operations.html