使用PIG进行电影数据集分析

时间:2017-05-16 13:14:07

标签: hadoop apache-pig

我有一个电影数据库的以下数据集:

评级:UserID,MovieID,评级::电影:MovieID,标题::用户:UserID,性别,年龄

现在我必须加入上述3个数据集,并确定哪个电影在女性中评分最高,在男性中评分最低,反之亦然。 我做了JOIN:

myusers = LOAD '/user/cloudera/movies/input/users.dat' 
  USING PigStorage(':') 
  AS (user:int, n1, gender:chararray, n2, age:int);

ratings = LOAD '/user/cloudera/movies/input/ratings.dat' 
  USING PigStorage(':') 
  AS (user:int, n1, movie:int, n2, rating:int);

movies = LOAD '/user/cloudera/movies/input/movies.dat' 
  USING PigStorage(':') 
  AS (movie:int,n1,title:chararray);

data = JOIN ratings BY user, myusers BY user;
data2= JOIN data BY ratings::movie, movies BY movie;

但在此之后我遇到了许多问题,例如" ERROR 0:Scalar在输出中有多行"当我尝试从data2打印列时。有什么想法可以帮助我完成这项任务吗?

1 个答案:

答案 0 :(得分:0)

执行以下步骤

data = JOIN ratings BY user, myusers BY user;

使用性别作为过滤器创建两个数据集,一个用于男性,另一个用于女性。输出数据集并获取两个数据集的最大值和最小值。

male = FILTER data by gender == 'M'; -- Use the gender value for male
female = FILTER data by gender == 'F';
m_max = LIMIT (ORDER male by rating DESC) 1;
f_max = LIMIT (ORDER female by rating DESC) 1;
m_min = LIMIT (ORDER male by rating ASC) 1;
f_min = LIMIT (ORDER female by rating ASC) 1;