我正在尝试生成聚合输出。问题是所有数据都将转移到单个reducer(Filter和Count正在产生问题)。如何优化以下脚本?
预期产量: 小组,10,2,12,34 ......
data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray);
grp1 = GROUP data BY UA PARALLEL 50;
fr1 = FOREACH grp1 {
fltrCol1 = FILTER data BY Col1 == 'Other';
fltrCol2 = FILTER data BY Col2 == 'Other';
fltrCol3 = FILTER data BY Col3 == 'Other';
fltrCol4 = FILTER data BY col4 == 'Other';
fltrCol5 = FILTER data BY col5 == 'Other';
cnt_fltrCol1 = COUNT(fltrCol1);
cnt_fltrCol2 = COUNT(fltrCol2);
cnt_fltrCol3 = COUNT(fltrCol3);
cnt_fltrCol4 = COUNT(fltrCol4);
cnt_fltrCol5 = COUNT(fltrCol5);
GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5;
}
答案 0 :(得分:1)
您可以通过将fltrCol {1,2,3,4,5}列添加为整数来将过滤器逻辑放在组之前,而不是将它们相加。从头到尾这里是脚本:
data = LOAD '/input/useragents' USING PigStorage('\t') AS (Col1:chararray,Col2:chararray,Col3:chararray,col4:chararray,col5:chararray);
filter = FOREACH data GENERATE UA,
((Col1 == 'Other') ? 1 : 0) as fltrCol1,
((Col2 == 'Other') ? 1 : 0) as fltrCol2,
((Col3 == 'Other') ? 1 : 0) as fltrCol3,
((Col4 == 'Other') ? 1 : 0) as fltrCol4,
((Col5 == 'Other') ? 1 : 0) as fltrCol5;
grp1 = GROUP data BY UA PARALLEL 50;
fr1 = FOREACH grp1 {
cnt_fltrCol1 = SUM(fltrCol1);
cnt_fltrCol2 = SUM(fltrCol2);
cnt_fltrCol3 = SUM(fltrCol3);
cnt_fltrCol4 = SUM(fltrCol4);
cnt_fltrCol5 = SUM(fltrCol5);
GENERATE group,cnt_fltrCol1,cnt_fltrCol2,cnt_fltrCol3,cnt_fltrCol4,cnt_fltrCol5;
}