我是Apache pig的新手,无法弄清楚使用piggybank的Over函数进行累积计算有什么问题。我希望每个时期的累积工资给出以下数据的相同业务和位置:
business|location|period|salary
--------+--------+------+-------
100 | East | 1 | 100
100 | East | 1 | 55
100 | East | 2 | 100
100 | East | 3 | 150
100 | West | 1 | 150
100 | West | 2 | 200
100 | West | 3 | 250
200 | East | 1 | 50
200 | East | 2 | 50
200 | East | 3 | 50
200 | West | 1 | 80
200 | West | 2 | 100
200 | West | 3 | 120
我正在寻找的结果是:
business|location|period|cumulative salary
--------+--------+------+---------------
100 | East | 1 | 155
100 | East | 2 | 255
100 | East | 3 | 405
100 | West | 1 | 150
100 | West | 2 | 350
100 | West | 3 | 600
200 | East | 1 | 50
200 | East | 2 | 100
200 | East | 3 | 150
200 | West | 1 | 80
200 | West | 2 | 180
200 | West | 3 | 300
根据这篇Over doc,我应该可以通过
来完成REGISTER /opt/mapr/pig/pig-0.12/contrib/piggybank/java/piggybank.jar;
A = LOAD '/user/sliang/pig/testData' USING PigStorage(',') as (business:long, location:chararray, period:long, salary:long);
B = group A by (business, location);
C = foreach B {
C1 = order A by period;
generate flatten(Stitch(C1, Over(C1.salary, 'sum(long)')));
};
D = foreach C generate business, location, period, $9;
但我在C:
开始出错[main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1070: Could not resolve Stitch using imports: [, java.lang., org.apache.pig.builtin., org.apache.pig.impl.builtin.]
我用谷歌搜索但没有太多关于此的信息...我还检查了罐子与其他储钱功能,它的工作原理,所以我想这不是因为皮卡没有正确注册。我正在使用猪0.12版本。
非常感谢任何帮助。谢谢!
答案 0 :(得分:4)
使用Stitch
和超过command
的完整包路径。
即,将Stitch
替换为org.apache.pig.piggybank.evaluation.Stitch
和
Over
与org.apache.pig.piggybank.evaluation.Over
如果你想在你的猪脚本中避免使用上面冗长的包名,那么就定义你自己的宏,并在你的猪脚本中使用它。
DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;
DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch;
更新了Pigscript:
A = LOAD '/user/sliang/pig/testData' USING PigStorage(',') as (business:long, location:chararray, period:long, salary:long);
B = group A by (business, location);
C = foreach B {
C1 = order A by period;
generate flatten(org.apache.pig.piggybank.evaluation.Stitch(C1, org.apache.pig.piggybank.evaluation.Over(C1.salary, 'sum(long)')));
};
D = foreach C generate business, location, period, $4;
E = RANK D;
F = GROUP E BY (stitched::business,stitched::location,stitched::period);
G = FOREACH F {
sortRankByDesc = ORDER E BY rank_D DESC;
topRank = LIMIT sortRankByDesc 1;
GENERATE FLATTEN(topRank);
}
H = FOREACH G GENERATE $1 AS business,$2 AS location,$3 AS period,$4 AS salary;
DUMP H;
<强>输出强>
(100,East,1,155)
(100,East,2,255)
(100,East,3,405)
(100,West,1,150)
(100,West,2,350)
(100,West,3,600)
(200,East,1,50)
(200,East,2,100)
(200,East,3,150)
(200,West,1,80)
(200,West,2,180)
(200,West,3,300)