我正在使用IMDB数据库找到评分最高的演员/女演员,并且在给定年份中的电影数量最多。我正在尝试加入演员数据集及其评级。然后过滤年份并根据最高评级和电影数量对数据进行排序。
joinedActorRating = JOIN ratings by movie, actors BY movie;
actorRating = FOREACH joinedActorRating GENERATE *;
actorsYear = FILTER actorRating BY(year MATCHES '2000');
groupedYear = GROUP actorsYear BY (year,rating,firstName,lastName);
aggregatedYear = FOREACH groupedYear GENERATE group, COUNT (actorsYear) AS movieCount;
unaggregatedYear = FOREACH aggregatedYear GENERATE FLATTEN(group) AS (year,rating,firstName,lastName);
sortRating = ORDER unaggregatedYear BY rating ASC, count ASC;
dump sortRating;
编译器说第二行是“无效字段投影”,但我不确定如何在加入两个数据集后访问年份字段。有谁知道如何解决这个问题?
答案 0 :(得分:0)
加入后,您需要将所需的字段投影到当前关系。
joinedActorRating = JOIN ratings by movie, actors BY movie;
actorRating = FOREACH joinedActorRating GENERATE ratings::movie as movie
, ratings::rank as rank, ratings::year as year, actors::firstName as firstName
, actors::lastName as lastName;
我不确定哪个列在哪个表中(除了电影之外),因为你没有包含这两个表,所以我猜对了。您可以根据需要修改投影。