我有一个用例,我需要计算两个字段的不同数量。
示例:
x = LOAD 'testdata' using PigStorage('^A') as (a,b,c,d);
y = GROUP x BY a;
z = FOREACH y {
**bc = DISTINCT x.b,x.c;**
dd = DISTINCT x.d;
GENERATE FLATTEN(group) as (a), COUNT(bc), COUNT(dd);
};
答案 0 :(得分:9)
你很亲密。关键是不要将DISTINCT
应用于两个字段,而是将其应用于您创建的单个复合字段:
x = LOAD 'testdata' using PigStorage('^A') as (a,b,c,d);
x2 = FOREACH x GENERATE a, TOTUPLE(b,c) AS bc, d
y = GROUP x2 BY a;
z = FOREACH y {
bc = DISTINCT x2.bc;
dd = DISTINCT x2.d;
GENERATE FLATTEN(group) AS (a), COUNT(bc), COUNT(dd);
};
答案 1 :(得分:0)
恕我直言,没有简单的方法(比如MySQL中的GROUP(DISTINCT a)
),所以你需要拆分你的表,每行两个计数。
x = LOAD 'testdata' using PigStorage('^A') as (a,b,c,d);
w1 = FOREACH x GENERATE a, CONCAT(b,c) AS bc;
w2 = FOREACH x GENERATE a, d;
v1 = DISTINCT w1;
v2 = DISTINCT w2;
u1 = GROUP v1 BY a;
u2 = GROUP v2 BY a;
t1 = FOREACH u1 GENERATE group AS a, COUNT(v1.bc);
t2 = FOREACH u2 GENERATE group AS a, COUNT(v2.d);
s = JOIN t1 BY a, t2 BY a;
UDF可以大大简化这一过程。