猪进行2个袋子的交叉排序

时间:2014-03-26 09:25:47

标签: hadoop apache-pig

我从HDFS加载了2个已排序的数据包。现在我想执行合并连接或设置相交以返回(3,风暴的孤儿),(7,穆里尔的婚礼)结果。

我遇到一些问题需要使用datafu或pig mergejoin功能。

我尝试了如下所述的天真解决方案,但它没有利用我的数据进行排序。

vegas = LOAD 'vegas' USING PigStorage() AS (B1:bag{T1:tuple(id:int, name:chararray)});
macau = LOAD 'macau' USING PigStorage() AS (B2:bag{T2:tuple(id:int, name:chararray)});
vegast = FOREACH vegas GENERATE FLATTEN(vegas.$0) AS (id:int,name:chararray);
macaut = FOREACH hotel GENERATE FLATTEN(macau.$0) AS (id:int,name:chararray);

F = join vegast by id, macaut by id;
-- o/p: (3,Orphans of the Storm), (7,Muriel's Wedding)
-- describe vegas
--vegas: {B1: {T1: (id: int,name: chararray)}}
-- data for vegas
--({(3,Orphans of the Storm),(6,One Magic Christmas),(7,Muriel's Wedding),(8,Mother's Boys),(9,Nosferatu: Original Version)})

-- describe macau
--macau: {B1: {T1: (id: int,name: chararray)}}
--data for macau
--({(1,The Nightmare Before Christmas),(3,Orphans of the Storm),(4,The Object of Beauty),(7,Muriel's Wedding)})

有人可以建议找到使用猪分拣的2袋交叉的最佳方法是什么?

3 个答案:

答案 0 :(得分:0)

我们Xplenty(Hadoop平台即服务)在行李上设置操作时遇到了同样的问题,我们决定采用简单的路径并在JRuby UDF中实现集合操作。

为了执行它,您需要在节点上安装jruby。

请参阅此处了解代码:https://gist.github.com/saggineumann/9804083

答案 1 :(得分:0)

如果关系按连接字段排序,则可以合并连接它们。

F = join vegast by id, macaut by id USING 'merge';

在Pig文档中查看更多内容:http://pig.apache.org/docs/r0.13.0/perf.html#merge-joins

答案 2 :(得分:0)

  

如果有人在datafu或PigMergeJoin中使用SetIntersection工作,请提供提示

SetIntersect datafu guide。如果您正在加载以下结构或在执行连接后实现它(请注意必须对行李进行分类)

DESCRIBE relationWith2Bags
relationWith2Bags: {B1: {(id: int,name: chararray)},B2: {(id: int,name: chararray)}}
--let it contain only 1 tuple with sorted bags from the question
--B1: {(3,Orphans of the Storm),(6,One Magic Christmas),(7,Muriel's Wedding),(8,Mother's Boys),(9,Nosferatu: Original Version)}
--B2: {(1,The Nightmare Before Christmas),(3,Orphans of the Storm),(4,The Object of Beauty),(7,Muriel's Wedding)}

intersect = FOREACH relationWith2Bags GENERATE datafu.pig.sets.SetIntersect(B1, B2);
DUMP intersect
--({(3,Orphans of the Storm),(7,Muriel's Wedding)})