猪分裂和加入

时间:2012-09-27 16:40:56

标签: apache-pig

我需要将字段值从一行传播到另一行给定类型的记录 例如我的原始输入是

1,firefox,p  
1,,q
1,,r
1,,s
2,ie,p
2,,s
3,chrome,p
3,,r
3,,s
4,netscape,p

期望的结果

1,firefox,p  
1,firefox,q
1,firefox,r
1,firefox,s
2,ie,p
2,ie,s
3,chrome,p
3,chrome,r
3,chrome,s
4,netscape,p

我试过

A = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
SPLIT A INTO B IF (type =='p'), C IF (type!='p' );
joined =  JOIN B BY id FULL, C BY id;
joinedFields = FOREACH joined GENERATE  B::id,  B::type, B::browser, C::id, C::type;
dump joinedFields;

我得到的结果是

(,,,1,p  )
(,,,1,q)
(,,,1,r)
(,,,1,s)
(2,p,ie,2,s)
(3,p,chrome,3,r)
(3,p,chrome,3,s)
(4,p,netscape,,)

感谢任何帮助,谢谢。

1 个答案:

答案 0 :(得分:2)

PIG不完全是SQL,它是用数据流,MapReduce和组构建的(连接也在那里)。您可以使用嵌套在FOREACH和FLATTEN中的GROUP BY,FILTER获得结果。

inpt = LOAD 'file1.txt' using PigStorage(',') AS (id:int,browser:chararray,type:chararray);
grp = GROUP inpt BY id;
Result = FOREACH grp {
    P = FILTER inpt BY type == 'p'; -- leave the record that contain p for the id
    PL = LIMIT P 1; -- make sure there is just one
    GENERATE FLATTEN(inpt.(id,type)), FLATTEN(PL.browser); -- convert bags produced by group by back to rows
};