PIG:使用条件

时间:2017-05-17 06:03:59

标签: apache-pig

我有一个电影数据库的以下数据集:

评级:UserID,MovieID,评级::电影:MovieID,标题::用户:UserID,性别,年龄

现在我加入了评分和用户。目标是通过性别F和M一起确定每个movieID评级。还包括F和M至少有20个评级的电影。

data = JOIN myuser BY user, myrating BY user;
grouped_users = GROUP data BY (movie,gender);

现在在分组用户之后,我需要过滤掉两个性别都低于20的电影。我怎么能这样做?

grouped_users_twenty = FILTER grouped_users BY SIZE(grouped_users)>=20;

这是我的逻辑。得到错误。

2 个答案:

答案 0 :(得分:0)

data = JOIN myuser BY user, myrating BY user;
grouped_users = foreach (GROUP data BY (movie,gender)) {
    generate
        group.movie,
        group.gender,
        SIZE(data) as user_size
    ;
};

grouped_users_twenty = FILTER grouped_users BY user_size>=20;

答案 1 :(得分:0)

您必须使用COUNT而不是SIZE

grouped_users_twenty = FOREACH grouped_users GENERATE group,COUNT(rating) as rating_count;
final = FILTER grouped_users_twenty BY rating_count >= 20;