Question

我们假设我有一个如下表所示的表，可能包含或不包含给定字段的重复项：

ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
002    http://example.com/beth?extra=blah
003    http://example.com/charlie

我想编写一个Pig脚本，只根据单个字段的值查找DISTINCT行。例如，通过ID过滤上面的表格应该返回如下内容：

ID URL --- ------------------ 001 http://example.com/adam 002 http://example.com/beth 003 http://example.com/charlie

Pig GROUP BY运算符返回一个按ID分组的元组，如果我知道如何获得每个包的第一个元组（可能是一个单独的问题），它将起作用。

Pig DISTINCT运算符适用于整行，因此在这种情况下，所有四行都将被视为唯一行，这不是我想要的。

出于我的目的，我不关心返回ID为002的哪一行。

Answer 1

我找到了一种方法，使用GROUP BY和TOP运算符：

my_table = LOAD 'my_table_file' AS (A, B);

my_table_grouped = GROUP my_table BY A;

my_table_distinct = FOREACH my_table_grouped {

    -- For each group $0 refers to the group name, (A)
    -- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}.
    -- Here, we take only the first (top 1) row in the bag:

    result = TOP(1, 0, $1);
    GENERATE FLATTEN(result);

}

DUMP my_table_distinct;

这导致每个ID列有一个不同的行：

(001,http://example.com/adam)
(002,http://example.com/beth?extra=blah)
(003,http://example.com/charlie)

我不知道是否有更好的方法，但这对我有用。我希望这有助于其他人从猪开始。

（参考：http://pig.apache.org/docs/r0.12.1/func.html#topx）

Answer 2

我发现您可以使用嵌套分组并使用LIMIT来执行此操作所以使用Arel的示例：

my_table = LOAD 'my_table_file' AS (A, B);

-- Nested foreach grouping generates bags with same A,
-- limit bags to 1

my_table_distinct = FOREACH (GROUP my_table BY A) {
  result = LIMIT my_table 1;
  GENERATE FLATTEN(result);
}

DUMP my_table_distinct;

Answer 3

您可以使用

Apache DataFu™ (incubating)

FirstTupleFrom Bag

register datafu-pig-incubating-1.3.1.jar
define FirstTupleFromBag datafu.pig.bags.FirstTupleFromBag();
my_table_grouped = GROUP my_table BY A;
my_table_grouped_first_tuple = foreach my_table_grouped generate flatten(FirstTupleFromBag(my_table,null));

在Apache Pig中，根据单个列选择DISTINCT行

3 个答案: