如何在DataBag中找到不同的值?

时间:2014-03-13 00:20:59

标签: apache-pig

说我有一些像

这样的数据
1,A
1,A
1,B
2,C
2,D
3,E
3,E

我希望能够对第一列进行分组,然后返回该组中的不同值:

1,A,B
2,C,D
3,E

1,{A,B}
2,{C,D}
3,{E}

除了UDF之外还有办法做到这一点吗?

如果我这样做

DATA = LOAD 'data.txt' USING PigStorage(',') AS (a:int, b:chararray);

GROUPED = GROUP DATA BY a;

UNIQUES = FOREACH GROUPED {
    distinct_bs = DISTINCT GROUPED.b;
    GENERATE
        group AS a
        ,FLATTEN(distinct_bs)
    ;
}

(无论是否有FLATTEN,或者如果我包含group as a,我都会收到

ERROR 1200: org.apache.pig.newplan.logical.expression.ScalarExpression
cannot be cast to org.apache.pig.newplan.logical.expression.ProjectExpression

1 个答案:

答案 0 :(得分:0)

GROUPED不包含b,但DATA包含:

DESCRIBE GROUPED
GROUPED: {group: int,DATA: {(a: int,b: chararray)}}

尝试以下方法:

UNIQUES = FOREACH GROUPED {
    distinct_bs = DISTINCT DATA.b;
    GENERATE
        group AS a,
        distinct_bs;
}

结果:

(1,{(A),(B)})
(2,{(C),(D)})
(3,{(E)})