我正在使用PIG从元组生成组,如下所示:
a1, b1
a1, b2
a1, b3
...
->
a1, [b1, b2, b3]
...
这很容易且有效。但我的问题是得到以下内容:从获得的组中,我想在组的包中生成一组所有元组:
a1, [b1, b2, b3]
->
b1,b2
b1,b3
b2,b3
如果我可以嵌套“foreach”并首先遍历每个组然后遍历其包,这将很容易。
我想我误解了这个概念,我将非常感谢你的解释。
感谢。
答案 0 :(得分:15)
看起来你需要在包和它自身之间使用笛卡尔积。要做到这一点,你需要使用FLATTEN(袋)两次。
<强>代码:强>
inpt = load '.../group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as value_bag;
result = foreach id_grp generate id, FLATTEN(value_bag) as v1, FLATTEN(value_bag) as v2;
dump result;
请注意,大袋会产生很多行。为了避免它,你可以在FLATTEN之前使用TOP(...):
inpt = load '....group.txt' using PigStorage(',') as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
limited_bag = TOP(50, 0, values); -- all sorts of filtering could be done here
generate id, FLATTEN(limited_bag) as v1, FLATTEN(limited_bag) as v2;
};
dump result;
对于您的特定输出,您可以在FLATTEN之前使用一些过滤:
inpt = load '..../group.txt' as (id, val);
grp = group inpt by (id);
id_grp = foreach grp generate group as id, inpt.val as values;
result = foreach id_grp {
l = filter values by val == 'b1' or val == 'b2';
generate id, FLATTEN(l) as v1, FLATTEN(values) as v2;
};
result = filter result by v1 != v2;
我希望它有所帮助。
干杯
答案 1 :(得分:4)
UnorderedPairs UDF库中的DataFu函数也很相关。它会在包中生成所有项目对(在您的情况下是您的分组包)
答案 2 :(得分:1)
您可以使用GROUP ALL
pig语句生成
A = -- Some bag
B = -- Another bag
groupedB = group B ALL;
result = foreach A GENERATE
TOTUPLE(*), groupedB.$1;
-- Will generate
((a1), {(b1, b2, b3)})
((a2), {(b1, b2, b3)})
((a3), {(b1, b2, b3)})
...