我有一个数据集(userid,resH,resW)
使用
等数据(1001, 800, 600)
(1001, 800, 600)
(1002, 900, 700)
(1003, 900, 700)
(1004, 1800, 600)
(1005, 1800, 1600)
我想获得每组resH-reshW中不同用户的数量。
例如,具有上述数据的输出将是
800, 600, 1
900, 700, 2
1800, 600, 1
1800, 1600, 1
我试过像
这样的东西D = group data by (resH,resW);
E = foreach D {
unique = DISTINCT data.userId;
generate group, COUNT(unique) as unique_cnt;
};
但我没有得到我所期待的。
答案 0 :(得分:1)
加载数据然后将其分开以删除重复项,然后按两列感兴趣的列进行分组并计算用户ID。
A = LOAD 'data.csv' USING PigStorage(',') AS (userid:int,resH:int,resW:int);
B = DISTINCT A;
C = GROUP B BY (resH,resW);
D = FOREACH C GENERATE FLATTEN(group) AS (resH,resW),COUNT(A.userid);
DUMP D;
答案 1 :(得分:0)
需要输出的方式: -
Alias1 = LOAD 'input.txt' USING PigStorage(',') AS (userid:int,resH:int,resW:int);
Alias2 = DISTINCT Alias1;
Alias3 = GROUP Alias2 BY (resH,resW);
Alias4 = foreach Alias3 GENERATE flatten(group),flatten(COUNT(Alias2.userid)) ;
DUMP Alias4;
or
STORE Alias4 INTO 'output.txt' USING PigStorage(',');