在嵌套的FOREACH语句中重用Pig Groups

时间:2014-09-10 14:45:07

标签: apache-pig

我正在尝试将记录组合​​在一起,计算SCORE1的平均值,过滤掉分数的下半部分,并计算它们的SCORE2平均值。显然我可以计算摘要统计数据,并将它们重新加入原始数据集,但我更喜欢使用中间分组值。

示例输入

ID,GROUPBY,SCORE1,SCORE2
1,A,58.8,67.3
2,A,85.2,76.3
3,B,49.1,90.7
4,B,78.3,99.8

猪脚本

records = load 'example.csv' Using PigStorage(',') AS (ID,GROUPBY,SCORE1,SCORE2);
grouped = group records by GROUPBY;
avgscore = foreach grouped GENERATE group AS GROUPBY, AVG(records.SCORE1) AS AVGSCORE;
joined = join grouped BY group, avgscore BY GROUPBY USING 'replicated';
results = foreach joined {
    scores = foreach records generate SCORE1,SCORE2;
    low = FILTER scores by SCORE1 < avgscore.AVGSCORE;
    GENERATE GROUPBY, AVG(low.SCORE2);
};
dump results;

所需输出

A    67.3
B    90.7

但是这给了我java.lang.Exception的结果:org.apache.pig.backend.executionengine.ExecException:ERROR 0:标量在输出中有多行。第1名:(A,72.0),第2名:(B,63.7)

1 个答案:

答案 0 :(得分:1)

您实际上是在第4行中对两个不同的数据结构进行分组。 您正在使用avgscore(应该展平)加入分组(已分组)。

你应该这样做:

joined = join records BY GROUPBY, avgscore BY GROUPBY USING 'replicated';

编辑: 我会像这样重写以避免混淆(因为会有两个GROUPBY)

records = load 'example.csv' Using PigStorage(',') AS (ID,GROUPBY,SCORE1,SCORE2);
grouped = group records by GROUPBY;
avgscore = foreach grouped GENERATE group AS GROUPBY, AVG(records.SCORE1) AS AVGSCORE;
joined = join records BY GROUPBY, avgscore BY GROUPBY USING 'replicated';
joined_reduced = foreach joined generate ID, records::GROUPBY as GROUPBY, AVGSCORE, SCORE1, SCORE2;
filter_joined = filter joined_reduced by (SCORE1 > AVGSCORE);
grouped2 = group filter_joined by GROUPBY;
result = foreach grouped2 generate flatten (group), AVG(filter_joined.SCORE2) as low_avg;

dump result;