我有以下数据集:
答
x1 y z1
x2 y z2
x3 y z3
x43 y z33
x4 y2 z4
x5 y2 z5
x6 y2 z6
x7 y2 z7
B:
y 12
y2 25
正在加载A:LOAD' $ input'使用PigStorage()AS(k:chararray,m:chararray,n:chararray); 正在加载B:LOAD' $ input2'使用PigStorage()AS(o:chararray,p:int);
我正在加入A on m和B on o。我想做的是每个o只选择x个元组。因此,例如,如果x为2,则结果为:
x1 y z1
x2 y z2
x4 y2 z4
x5 y2 z5
答案 0 :(得分:1)
要做到这一点,你需要使用GROUP BY,FOREACH和嵌套的LIMIT,而不是JOIN或COGROUP。请参阅Pig 0.10中的实现,我使用输入数据来获取指定的输出:
A = load '~/pig/data/subset_join_A.dat' as (k:chararray, m:chararray, n:chararray);
B = load '~/pig/data/subset_join_B.dat' as (o:chararray, p:int);
-- as join will be on m, we need to leave only 2 rows per a value in m.
group_A = group A by m;
top_A_x = foreach group_A {
top = limit A 2; -- where x = 2
generate flatten(top);
};
-- another way to do join, allows us to do left or right joins and checks
co_join = cogroup top_A_x by (m), B by (o);
-- filter out records from A that are not in B
filter_join = filter co_join by IsEmpty(B) == false;
result = foreach filter_join generate flatten(top_A_x);
或者你可以只使用COGROUP实现它,FOREACH使用嵌套的LIMIT:
A = load '~/pig/data/subset_join_A.dat' as (k:chararray, m:chararray, n:chararray);
B = load '~/pig/data/subset_join_B.dat' as (o:chararray, p:int);
co_join = cogroup A by (m), B by (o);
filter_join = filter co_join by IsEmpty(B) == false;
result = foreach filter_join {
top = limit A 2;
--you can limit B as well
generate flatten(top);
};