Question

我有两个相同的包或关系（有相同的字段），比如B1和B2。我想从B1中减去B2，这样我就可以得到所有那些在B1但不在B2中的元组。 Pig中的SUBTRACT函数减去字段，但我正在寻找元组减法，如SQL中的“差异”或“减号”运算符。

示例：Bag / Relation R1具有以下元组

(a1, b1)
(a2, b2)
(a3, b3)
(a4, b4)

关系R2有以下元组

(a1, b1)
(a2, b2, d2)
(a3, b3, d3)
(a4, b4)

我想获得以下包含以下内容的Relation / Bag：

(a1, b1)
(a4, b4)

Answer 1

以下是我解决这个问题的方法：

table1 = load './subtract1.dat' USING PigStorage(',') as (c1, c2);
table2 = load './subtract2.dat' USING PigStorage(',') as (d1, d2, d3);

cgrp = cogroup table1 by (c1, c2), table2 by (d1, d2);

subtract = filter cgrp by IsEmpty(table2);
substract_flatten = FOREACH subtract GENERATE FLATTEN(table1);

dump subtract_flatten;

从http://agiletesting.blogspot.in/2012/02/set-operations-in-apache-pig.html

得到了这个想法

Answer 2

试试这个，您可以使用PIG JOIN操作来执行差异，减号等功能。请参考此处了解有关左差外连接的差异操作

http://blog.matthewrathbone.com/2013/04/07/real-world-hadoop---implementing-a-left-outer-join-in-pig.html

在Pig中减去两个包/关系

2 个答案: