优化聚合输出猪脚本

时间:2016-12-29 10:54:57

标签: hadoop apache-pig

我正在尝试生成聚合输出。 最好的方法是什么:

A_GROUP = GROUP A BY ID PARALLEL;
A_COUNT = FOREACH A_GROUP {
        A_TMP1 = FILTER A BY Col1 == 'Other';
        A_TMP2 = FILTER A BY Col2 == 'Other';
        cnt_fltrCol1 = COUNT(A_TMP1);
        cnt_fltrCol2 = COUNT(A_TMP2);
        GENERATE group,cnt_fltrCol1,cnt_fltrCol2;
} 

或者:

A_FOREACH = FOREACH A GENERATE *, 
        ((Col1 == 'Other') ? 1 : 0) as fltrCol1, 
        ((Col2 == 'Other') ? 1 : 0) as fltrCol2;

A_GRP = GROUP A_FOREACH BY ID;

A_COUNT = FOREACH A_GRP {
            cnt_fltrCol1 = SUM(fltrCol1);
            cnt_fltrCol2 = SUM(fltrCol2);            
            GENERATE    
            group,cnt_fltrCol1,cnt_fltrCol2;
    } 

目前,我有内存问题(我的真实脚本要大得多) 提前感谢您的回答。

0 个答案:

没有答案