我是Pig Latin的新手,我正在尝试使用Pig BUILT IN函数。
A = LOAD 'student.txt' AS (name:chararray, term:chararray, gpa:float);
B = GROUP A BY name;
DUMP B;
(John,{(John,sm,3.8),(John,sp,4.0),(John,wt,3.7),(John,fl ,3.9)})
(Mary,{(Mary,sm,4.0),(Mary,sp,4.0),(Mary,wt,3.9),(Mary,fl,3.8)})
我需要检索第一个元素=> (John,sm,3.8)
和最后一个元素=>从包中(John,fl ,3.9)
。
需要帮助才能解决使用UDF的问题。
答案 0 :(得分:1)
好的..你可以使用这个解决方案..但它有点冗长。
names = LOAD '/user/user/inputfiles/names.txt' USING PigStorage(',') AS(name:chararray,term:chararray,gpa:float);
names_rank = RANK names;
names_each = FOREACH names_rank GENERATE $0 as row_id,name,term,gpa;
names_grp = GROUP names_each BY name;
names_first_each = FOREACH names_grp
{
order_asc = ORDER names_each BY row_id ASC;
first_rec = LIMIT order_asc 1;
GENERATE flatten(first_rec) as(row_id,name,term,gpa);
};
names_last_each = FOREACH names_grp
{
order_desc = ORDER names_each BY row_id DESC;
last_rec = LIMIT order_desc 1;
GENERATE flatten(last_rec) as(row_id,name,term,gpa);
};
names_unioned = UNION names_first_each,names_last_each;
names_extract = FOREACH names_unioned GENERATE name,term,gpa;
names_ordered = ORDER names_extract BY name;
dump names_ordered;
输出: -
(John,fl,3.9)
(John,sm,3.8)
(Mary,fl,3.8)
(Mary,sm,4.0)