PIG - 获得最高和最高获得最低奖牌的国家,按年度分组

时间:2016-06-09 09:09:56

标签: hadoop apache-pig

  • Pig很新,我有一个由奥运会数据组成的数据集 4 - 5年。我想要产生最高和最低的奖牌 获胜国家每年分裂。她是一个带标题的样本。

    运动员,国家,年份,运动,金牌,银牌,铜牌,

    Yang Yilin,China,2008,Gymnastics,1,0,2,3  
    Leisel Jones,Australia,2000,Swimming,0,2,0,2  
    Go Gi-Hyeon,South Korea,2002,Short-Track Speed Skating,1,1,0,2  
    Chen Ruolin,China,2008,Diving,2,0,0,2  
    Katie Ledecky,United States,2012,Swimming,1,0,0,1  
    Ruta Meilutyte,Lithuania,2012,Swimming,1,0,0,1  
    Dániel Gyurta,Hungary,2004,Swimming,0,1,0,1  
    Arianna Fontana,Italy,2006,Short-Track Speed Skating,0,0,1,1  
    Olga Glatskikh,Russia,2004,Rhythmic Gymnastics,1,0,0,1  
    Kharikleia Pantazi,Greece,2000,Rhythmic Gymnastics,0,0,1,1
    
  • 我根据自己的知识尝试了我的选择来获得这个,但很少 sucess。 这就是我现在拥有的。解决这个问题的任何帮助都将是 赞赏!

    DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;    
    DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch; 
    
    A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
    B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;  
    C = GROUP B BY (YEAR,COUNTRY);  
    D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);   
    E = GROUP D BY (YEAR,COUNTRY);  
    F = FOREACH E { 
             E1 = ORDER D BY TOT DESC;
             GENERATE FLATTEN(MYSTITCH(E1, MYOVER(E1,'dense_rank',0,1,1))); 
             }; 
    
    G = FOREACH F GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::TOT,$3;
    

    MyOutput:(考虑到有很多国家拥有相同的TOTAL奖牌 ,我希望不止一个国家​​可以分享一个RANK)

    (2000,Cuba,65,1)    
    (2000,Iran,4,1)    
    (2000,Chile,17,1)    
    (2000,China,79,1)    
    (2000,India,7,1)    
    (2000,Italy,65,1)    
    (2000,Japan,42,1)    
    (2000,Kenya,7,1)   
    (2000,Qatar,1,1)   
    (2000,Spain,42,1)   
    (2000,Brazil,48,1)
    

预期输出:1

YEAR COUNTRY MAX(TOTAL)       
2001 India  50  
2003 UK     90   
2006 Japan  56  

&安培;

预期输出:2

YEAR COUNTRY MIN(TOTAL)
2001 India  5   
2003 UK     10   
2006 Japan  6

*********更新了查询(按预期运行良好)****

  • 这是更新的查询,它给了我想要的结果。

    DEFINE MYOVER org.apache.pig.piggybank.evaluation.Over;    
    DEFINE MYSTITCH org.apache.pig.piggybank.evaluation.Stitch; 
    
    A = LOAD 'MortDataSite/MyPigExercise/OlympicMedals.csv' using PigStorage(',') as (ATHLETE:CHARARRAY,COUNTRY:CHARARRAY,YEAR:INT,SPORT:CHARARRAY,GOLD:INT,SILVER:INT,BRONZE:INT,TOTAL:INT);
    B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;  
    C = GROUP B BY (YEAR,COUNTRY);  
    D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL);   
    E = GROUP D BY (YEAR,COUNTRY); 
    F = FOREACH E GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,MAX(D.TOT) as MTOT;
    G = GROUP F BY YEAR;
    H = FOREACH G {
                G1 = ORDER F BY MTOT DESC;
                GENERATE FLATTEN(MYSTITCH(G1, MYOVER(G1,'dense_rank',0,1,1))); 
                  };
         J = FOREACH H GENERATE stitched::YEAR,stitched::COUNTRY ,stitched::MTOT,$3; 
    

**输出:**
    年份最大(总计)。排名
    (2000,美国,242,1)
    (2000年,俄罗斯,187,2)
    (2000年,澳大利亚,182,3)
    (2002年,美国,84,1)
    (2002年,加拿大,74,2)
    (2002年,德国,61,3)
    (2004年,美国,265,1)
    (2004年,俄罗斯,190,2)
    (2004年,澳大利亚,156,3)

1 个答案:

答案 0 :(得分:0)

如果您希望按国家/地区逐年获得MAX和MIN总奖牌,只需使用MAX和MIN。

B = FOREACH A GENERATE YEAR,COUNTRY,TOTAL;  
C = GROUP B BY (YEAR,COUNTRY);  
D = FOREACH C GENERATE FLATTEN(group) as (YEAR,COUNTRY) ,SUM(B.TOTAL) as TOTAL;   
E = GROUP D BY (YEAR,COUNTRY);    
F = FOREACH E GENERATE group as  (YEAR,COUNTRY),MAX(D.TOTAL);
G = FOREACH E GENERATE group as  (YEAR,COUNTRY),MIN(D.TOTAL);
DUMP F;
DUMP G;