所以,我在hdfs中有以下数据。
user_id, category_id
1, 12344
1, 12344
1, 12345
2, 12345
2, 12345
3, 12344
3, 12344
等等..我想找到每个类别的重复用户数量。
所以,在上面的例子中..
12344, 2 (because user_id 1 and 3 are repeated users)
12345, 1 (user_id 2 is repeated user.. 1 is not as that user visited just once)
我如何在猪身上做到这一点?
答案 0 :(得分:1)
首先尝试只保留重复的用户,然后应用分组并计算它们将最终解决方案..请按以下代码尝试
输入:
1,12344
1,12344
1,12345
2,12345
2,12345
3,12344
3,12344
Pig Script:
records = LOAD '/home/inputfiles/repeats.txt' USING PigStorage(',') AS(id:int,category:int);
records_grp = GROUP records BY (id,category);
records_each = FOREACH records_grp GENERATE FLATTEN(group) AS(id,category), (COUNT(records.id) >1 ?'Y' : 'N') as repeat_ind;
records_filter = FILTER records_each BY repeat_ind == 'Y';
rec_grp = GROUP records_filter BY category;
rec_each = FOREACH rec_grp GENERATE group as category, COUNT(records_filter) as cnt_of_repeated_users;
dump rec_each;
输出:
(12344,2)
(12345,1)