我没有成功让RANK
运算符在pig版本0.12.1中工作。我想在发布错误报告之前会在这里发帖。
这是我非常简单的脚本:
InputData = LOAD '$in' USING PigStorage('\u0001') AS (a1:chararray, a2:chararray, score:float);
Ranked = RANK InputData BY score DESC DENSE;
OutputData = FOREACH Ranked GENERATE
rank_InputData AS rank,
a1 AS a1,
score AS score;
STORE OutputData INTO '$out' using PigStorage('\u0001');
我使用两个不同版本的输入运行它:
这些输入包含相同的数据(280M行,25GB),但第二个输入的数据聚合为较少数量的文件(由this thread related to a counters-per-mapper issue激励)。
两个输入的结果(“Java堆空间”错误)相同:
Backend error message
---------------------
Error: Java heap space
Pig Stack Trace
---------------
ERROR 2244: Job failed, hadoop does not return any error message
org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job failed, hadoop does not return any error message
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:148)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:202)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:173)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
at org.apache.pig.Main.run(Main.java:478)
at org.apache.pig.Main.main(Main.java:156)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
我尝试了另一个版本的脚本,它使用ORDER ... BY
来完成一些工作:
InputData = LOAD '$in' USING PigStorage('\u0001') AS (a1:chararray, a2:chararray, score:float);
InputDataOrdered = ORDER InputData BY score DESC PARALLEL 60;
Ranked = RANK InputDataOrdered;
OutputData = FOREACH Ranked GENERATE
rank_InputDataOrdered AS rank,
a1 AS a1,
score AS score;
STORE OutputData INTO '$out' using PigStorage('\u0001');
两个输入都给出了相同的错误,这次是“太多计数器”:
Pig Stack Trace
---------------
ERROR 2043: Unexpected error during execution.
...
Caused by: java.io.IOException: Error reading responses
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:843)
Caused by: org.apache.hadoop.mapreduce.counters.LimitExceededException: Too many counters: 121 max=120
at org.apache.hadoop.mapreduce.counters.Limits.checkCounters(Limits.java:61)
at org.apache.hadoop.mapreduce.counters.Limits.incrCounters(Limits.java:68)
at org.apache.hadoop.mapreduce.counters.AbstractCounterGroup.readFields(AbstractCounterGroup.java:174)
at org.apache.hadoop.mapred.Counters$Group.readFields(Counters.java:278)
at org.apache.hadoop.mapreduce.counters.AbstractCounters.readFields(AbstractCounters.java:303)
at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:280)
at org.apache.hadoop.io.ObjectWritable.readFields(ObjectWritable.java:75)
at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:952)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:836)
有什么想法吗?
答案 0 :(得分:0)
我建议您在制作排名字段时在DataFu库中使用Enumerate UDF。
register 'datafu-1.2.0.jar';
define Enumerate datafu.pig.bags.Enumerate('1');
A = load 'test.dat' using PigStorage() AS ( f1:int, f2:int);
A = group A ALL;
A = foreach A {
sorted = order A by f1 DESC, f2 ASC;
generate group, sorted;
}
B = foreach A generate FLATTEN( Enumerate( sorted ));
dump B; <- ( f1, f2, rank)
参考http://datafu.incubator.apache.org/docs/datafu/1.1.0/datafu/pig/bags/Enumerate.html