使用Apache Pig从Bag获取第一个和最后一个元组

时间:2015-09-18 13:26:33

标签: apache-pig

我是Pig Latin的新手,我正在尝试使用Pig BUILT IN函数。

A = LOAD 'student.txt' AS (name:chararray, term:chararray, gpa:float);

B = GROUP A BY name;

DUMP B;

(John,{(John,sm,3.8),(John,sp,4.0),(John,wt,3.7),(John,fl ,3.9)})

(Mary,{(Mary,sm,4.0),(Mary,sp,4.0),(Mary,wt,3.9),(Mary,fl,3.8)})

我需要检索第一个元素=> (John,sm,3.8)和最后一个元素=>从包中(John,fl ,3.9)

需要帮助才能解决使用UDF的问题。

1 个答案:

答案 0 :(得分:1)

好的..你可以使用这个解决方案..但它有点冗长。

names = LOAD '/user/user/inputfiles/names.txt' USING PigStorage(',') AS(name:chararray,term:chararray,gpa:float);

names_rank = RANK names;

names_each = FOREACH names_rank GENERATE $0 as row_id,name,term,gpa;

names_grp = GROUP names_each BY name;

names_first_each = FOREACH names_grp 
                            {
                              order_asc = ORDER names_each BY row_id ASC;
                              first_rec = LIMIT order_asc 1;

                              GENERATE flatten(first_rec) as(row_id,name,term,gpa);

                             };

names_last_each = FOREACH names_grp
                             {
                               order_desc = ORDER names_each BY row_id DESC;
                               last_rec   = LIMIT order_desc 1;

                               GENERATE flatten(last_rec) as(row_id,name,term,gpa);

                              };

names_unioned = UNION names_first_each,names_last_each;

names_extract = FOREACH names_unioned  GENERATE name,term,gpa;

names_ordered = ORDER names_extract BY name;

dump names_ordered;

输出: -

(John,fl,3.9)
(John,sm,3.8)
(Mary,fl,3.8)
(Mary,sm,4.0)