您好我在Pig编程方面相对较新,遇到了一个我很难解决的问题:
我有2个数据集
答:( accountId:chararray,标题:chararray,类型:chararray)
("A123", "Harry Potter", "Action/Adventure")
("A123", "Sherlock Holmes", "Mystery")
("B456", "James Bond", "Action")
("B456", "Hamlet", "Drama")
B :( accountId:chararray,title:chararray,genre:chararray)
("B456", "Percy Jackson", "Action/Adventure")
("B456", "Elementary", "Mystery")
("A123", "Divergent", "Action")
("A123", "Downton Abbey", "Drama")
我想要的结果应该是
(accountId:charray,{(),(),...}
(A123, {("A123", "Harry Potter", "Action/Adventure"),
("A123", "Sherlock Holmes", "Mystery"),
("A123", "Divergent", "Action"),
("A123", "Downton Abbey", "Drama")
})
(B456, {("B456", "James Bond", "Action"),
("B456", "Hamlet", "Drama"),
("B456", "Percy Jackson", "Action/Adventure"),
("B456", "Elementary", "Mystery")
})
目前我在做:
ANS =加入BY accountId,B BY accountId;
但结果看起来像
SCHEMA :( accountId:chararray,{(accountId:chararray,title:chararray,genre:chararray),...})
(B456, {("B456", "James Bond", "Action"),
("B456", "Hamlet", "Drama")}
"B456", {
("B456", "Percy Jackson", "Action/Adventure"),
("B456", "Elementary", "Mystery")
})
知道我可能做错了什么。
答案 0 :(得分:1)
试试这个:
-- IMPORTANT: register datafu.jar
define BagConcat datafu.pig.bags.BagConcat();
A = load 'A' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);
B = load 'B' using PigStorage(',') as (id:chararray, title:chararray, genre:chararray);
C = cogroup A by id, B by id;
D = foreach C generate BagConcat(A, B);
dump D;
JOIN只是简单地从两个关系中加入行。你想完成两件事:
这两项行动由COGROUP执行。我读到的最好的解释是:http://joshualande.com/cogroup-in-pig/
您的关系现在将包含组密钥(ID)和两个包(一个来自A,一个来自B),每个包含原始关系中的行;将它们“合并”成一个包的方法是使用datafu.jar中的BagConcat函数。 datafu是一个PIG UDF库,里面装满了好东西。您可以在此处阅读:http://datafu.incubator.apache.org/docs/datafu/guide/bag-operations.html