如何从PIG中的每个组中获取MAX

时间:2014-11-18 13:58:10

标签: hadoop apache-pig

1)

  

q3_ReasonYearWise = FOREACH q3_SelectedColumnForReason GENERATE   GetYear(application_dt)为Application_Year,loan_purpose;

2)

  

q3_Group_Reason_Year = GROUP q3_ReasonYearWise BY(Application_Year,   loan_purpose);

3)

  

q3_Count_Reasons_Yearwise = FOREACH q3_Group_Reason_Year GENERATE   group as me,COUNT(q3_ReasonYearWise。(Application_Year,loan_purpose))   作为tot;

直到3步它运行正常。运行第3步后,我的输出是

(2007,car)      5
(2007,house)    1
(2007,other)    53
(2007,moving)   6
(2007,medical)  2

(2008,car)      41
(2008,house)    16
(2008,other)    208
(2008,moving)   20
(2008,medical)  27
(2008,wedding)  44
(2008,vacation) 9

(2009,car)      170
(2009,house)    60
(2009,other)    595
(2009,moving)   58
(2009,medical)  84
(2009,wedding)  132
(2009,vacation) 26

所以在此之后如何找到每年的Max。我的输出必须像......

(2007, Other)   53
(2008,other)    208
(2009,other)    595

1 个答案:

答案 0 :(得分:0)

你能这样试试吗?

  1. 包括Application_Year也在3 stmt
  2. 按申请年份分组
  3. 按照订单顺序对行李进行分类
  4. 获取顶级元素
  5. 打印组名并计算
  6. q3_Count_Reasons_Yearwise = FOREACH q3_Group_Reason_Year GENERATE q3_ReasonYearWise.Application_Year as my_application_year ,group as me,COUNT(q3_ReasonYearWise。(Application_Year,loan_purpose))tot;

    在你的第3个结束时你的输出应该是这样的,

    2007    (2007,car)      5
    2007    (2007,house)    1
    
    2008    (2008,car)      41
    2008    (2008,house)    16
    

    之后就这样了。

    A = GROUP q3_Count_Reasons_Yearwise BY my_application_year;
    B = FOREACH A {
                    sortByMax = ORDER q3_Count_Reasons_Yearwise  BY tot DESC;
                    topMax = LIMIT sortByMax 1;
                    GENERATE FLATTEN(topMax.$1),FLATTEN(topMax.$2);
                    }
    DUMP B;