如何查找猪的重复用户数量

时间:2016-01-26 06:15:06

标签: apache-pig

所以,我在hdfs中有以下数据。

user_id, category_id
1, 12344
1, 12344
1, 12345
2, 12345
2, 12345
3, 12344
3, 12344

等等..我想找到每个类别的重复用户数量。

所以,在上面的例子中..

12344, 2 (because user_id 1 and 3 are repeated users)
12345, 1 (user_id 2 is repeated user.. 1 is not as that user visited just once)

我如何在猪身上做到这一点?

1 个答案:

答案 0 :(得分:1)

首先尝试只保留重复的用户,然后应用分组并计算它们将最终解决方案..请按以下代码尝试

输入:

1,12344
1,12344
1,12345
2,12345
2,12345
3,12344
3,12344

Pig Script:

 records = LOAD '/home/inputfiles/repeats.txt' USING PigStorage(',') AS(id:int,category:int);

records_grp = GROUP records BY (id,category);

records_each = FOREACH records_grp  GENERATE FLATTEN(group) AS(id,category), (COUNT(records.id) >1 ?'Y' : 'N') as repeat_ind;

records_filter = FILTER records_each BY repeat_ind == 'Y';

rec_grp  = GROUP records_filter BY category;

rec_each = FOREACH rec_grp GENERATE group as category, COUNT(records_filter) as cnt_of_repeated_users;

dump rec_each;

输出:

(12344,2)
(12345,1)