如何访问猪拉丁包中的数据字段?

时间:2015-12-06 15:30:50

标签: join group-by apache-pig datafield

我正在使用IMDB数据库找到评分最高的演员/女演员,并且在给定年份中的电影数量最多。我正在尝试加入演员数据集及其评级。然后过滤年份并根据最高评级和电影数量对数据进行排序。

joinedActorRating = JOIN ratings by movie, actors BY movie;
actorRating = FOREACH joinedActorRating GENERATE *;
actorsYear = FILTER actorRating BY(year MATCHES '2000');
groupedYear = GROUP actorsYear BY (year,rating,firstName,lastName);
aggregatedYear = FOREACH groupedYear GENERATE group, COUNT (actorsYear) AS movieCount;
unaggregatedYear = FOREACH aggregatedYear GENERATE FLATTEN(group) AS (year,rating,firstName,lastName);
sortRating = ORDER unaggregatedYear BY rating ASC, count ASC;
dump sortRating; 

编译器说第二行是“无效字段投影”,但我不确定如何在加入两个数据集后访问年份字段。有谁知道如何解决这个问题?

1 个答案:

答案 0 :(得分:0)

加入后,您需要将所需的字段投影到当前关系。

joinedActorRating = JOIN ratings by movie, actors BY movie;
actorRating = FOREACH joinedActorRating GENERATE ratings::movie as movie
    , ratings::rank as rank, ratings::year as year, actors::firstName as firstName
    , actors::lastName as lastName;

我不确定哪个列在哪个表中(除了电影之外),因为你没有包含这两个表,所以我猜对了。您可以根据需要修改投影。