我有一个用户和元素的数据集,我希望在其中找到至少有一个重叠元素的任何用户对。我的数据结构如下:
id element
--------------
1 a
1 b
1 b
2 b
3 a
4 c
在这种情况下,我会生成以下元组:
(1,2) // both have element "b" in common
(1,3) // both have element "a" in common
我已经编写了以下小规模工作的猪脚本,但是当我甚至100万行(~500MB)时,我在1.5小时后杀死了这份工作,因为它产生了近40GB的数据,这似乎是对于我想要完成的事情,我几乎不成比例。我是猪的新手,所以我希望这可以稍微优化一下。任何帮助将不胜感激。
-- load the data
mydata = LOAD '/path/to/my/data' USING PigStorage('\t') AS (user:int, element:chararray);
-- generate a copy to do a self join with
A = FOREACH mydata GENERATE user as user_2, element as element_2;
-- join them based on common tags
B = JOIN mydata BY element, A by element_2;
-- we only want the mapping in one direction, e.g. (1,2) is the same as (2,1)
C = FILTER B BY user < user_2;
-- we're only interested in the user ids
D = FOREACH C generate user, user_2;
-- remove any duplicate tuples
E = DISTINCT D;
STORE E INTO '/path/to/output';
注意:这是我上一个问题hadoop pig joining on any matching tuple values的后续行动,方法略有不同
答案 0 :(得分:0)
如果您的输入包含重复内容,则最好先过滤掉重复项,因为它们会导致组合爆炸。
您可以尝试的另一件事是分组而不是连接。您会立即得到结果,而不是作为对的列表:
mydata = LOAD '/path/to/data.tsv' USING PigStorage('\t') AS (user:int, element:chararray);
A = GROUP mydata by element;
B = foreach A generate (group, mydata.user) ;
illustrate B
然后给出:
---------------------------------------------------
| mydata | user:int | element:chararray |
---------------------------------------------------
| | 1 | a |
| | 3 | a |
---------------------------------------------------
---------------------------------------------------------------------------------------------
| A | group:chararray | mydata:bag{:tuple(user:int,element:chararray)} |
---------------------------------------------------------------------------------------------
| | a | {(1, a), (3, a)} |
---------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------
| B | org.apache.pig.builtin.totuple_group_13:tuple(group:chararray,:bag{:tuple(user:int)}) |
---------------------------------------------------------------------------------------------------------------------
| | (a, {(1), (3)}) |
---------------------------------------------------------------------------------------------------------------------
所以已经在B
中拥有共享元素的所有用户ID。
要获得对象列表,必须使用以下内容:
C = foreach B {
X = foreach $0 generate $0.$1;
Y = foreach $0 generate $0.$1;
F = CROSS X, Y ;
generate $0.group, flatten(F);
};
但它不起作用......我得到了:
org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception while executing [POProject (Name: Project[bag][1] - scope-131 Operator Key: scope-131) children: null at []]: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:338)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCross.accumulateData(POCross.java:202)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POCross.getNextTuple(POCross.java:116)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNextDataBag(PhysicalOperator.java:385)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:590)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PORelationToExprProject.getNextDataBag(PORelationToExprProject.java:106)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:309)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:464)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:432)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:412)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:256)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
at org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:236)
at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:257)
at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:238)
at org.apache.pig.pen.LineageTrimmingVisitor.init(LineageTrimmingVisitor.java:103)
at org.apache.pig.pen.LineageTrimmingVisitor.<init>(LineageTrimmingVisitor.java:98)
at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:166)
at org.apache.pig.PigServer.getExamples(PigServer.java:1238)
at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:831)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:802)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:381)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:198)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:69)
at org.apache.pig.Main.run(Main.java:541)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNextTuple(POProject.java:476)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.processInputBag(POProject.java:592)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNextDataBag(POProject.java:247)
at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:309)
... 35 more
2014-03-20 01:28:57,235 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2997: Encountered IOException. ExecException
这可能是猪的一个错误......在这次演习中我遇到了很多令人惊讶的事情。