我正在尝试将记录组合在一起,计算SCORE1的平均值,过滤掉分数的下半部分,并计算它们的SCORE2平均值。显然我可以计算摘要统计数据,并将它们重新加入原始数据集,但我更喜欢使用中间分组值。
示例输入
ID,GROUPBY,SCORE1,SCORE2
1,A,58.8,67.3
2,A,85.2,76.3
3,B,49.1,90.7
4,B,78.3,99.8
猪脚本
records = load 'example.csv' Using PigStorage(',') AS (ID,GROUPBY,SCORE1,SCORE2);
grouped = group records by GROUPBY;
avgscore = foreach grouped GENERATE group AS GROUPBY, AVG(records.SCORE1) AS AVGSCORE;
joined = join grouped BY group, avgscore BY GROUPBY USING 'replicated';
results = foreach joined {
scores = foreach records generate SCORE1,SCORE2;
low = FILTER scores by SCORE1 < avgscore.AVGSCORE;
GENERATE GROUPBY, AVG(low.SCORE2);
};
dump results;
所需输出
A 67.3
B 90.7
但是这给了我java.lang.Exception的结果:org.apache.pig.backend.executionengine.ExecException:ERROR 0:标量在输出中有多行。第1名:(A,72.0),第2名:(B,63.7)
答案 0 :(得分:1)
您实际上是在第4行中对两个不同的数据结构进行分组。 您正在使用avgscore(应该展平)加入分组(已分组)。
你应该这样做:
joined = join records BY GROUPBY, avgscore BY GROUPBY USING 'replicated';
编辑: 我会像这样重写以避免混淆(因为会有两个GROUPBY)
records = load 'example.csv' Using PigStorage(',') AS (ID,GROUPBY,SCORE1,SCORE2);
grouped = group records by GROUPBY;
avgscore = foreach grouped GENERATE group AS GROUPBY, AVG(records.SCORE1) AS AVGSCORE;
joined = join records BY GROUPBY, avgscore BY GROUPBY USING 'replicated';
joined_reduced = foreach joined generate ID, records::GROUPBY as GROUPBY, AVGSCORE, SCORE1, SCORE2;
filter_joined = filter joined_reduced by (SCORE1 > AVGSCORE);
grouped2 = group filter_joined by GROUPBY;
result = foreach grouped2 generate flatten (group), AVG(filter_joined.SCORE2) as low_avg;
dump result;