Apache Pig转换顺序

时间:2018-06-27 10:12:09

标签: hadoop apache-pig

我正在阅读Alan Gates的Pig编程。

考虑代码:

ratings = LOAD '/user/maria_dev/ml-100k/u.data' AS 
    (userID:int, movieID:int, rating:int, ratingTime:int);

metadata = LOAD '/user/maria_dev/ml-100k/u.item' USING PigStorage ('|') AS 
    (movieID:int, movieTitle:chararray, releaseDate:chararray, imdbLink: chararray);

nameLookup = FOREACH metadata GENERATE 
    movieID, movieTitle, ToDate(releaseDate, 'dd-MMM-yyyy') AS releaseYear;

nameLookupYear = FOREACH nameLookup GENERATE 
    movieID, movieTitle, GetYear(releaseYear) AS finalYear;

filterMovies = FILTER nameLookupYear BY finalYear < 1982;

groupedMovies = GROUP filterMovies BY finalYear;

orderedMovies = FOREACH groupedMovies {
    sortOrder = ORDER metadata by finalYear DESC;
    GENERATE GROUP, finalYear;
    };

DUMP orderedMovies;

它表明

  

“按地图,元组或袋子排序会产生错误”。

我想知道如何对分组结果进行排序。

转换是否需要遵循一定的顺序才能生效?

2 个答案:

答案 0 :(得分:1)

如果要对分组的值进行排序,则必须使用嵌套的foreach。这将在组中以降序对年份进行排序。

orderedMovies = FOREACH groupedMovies {
      sortOrder = ORDER metadata by GetYear(ToDate(releaseDate, 'dd-MMM-yyyy')) DESC;
      GENERATE GROUP, movieID, movieTitle;
};

答案 1 :(得分:1)

由于您要对分组结果进行排序,因此不需要嵌套的foreach。例如,如果要对年度内的每部电影按标题或发行日期进行排序,则可以使用嵌套的foreach。尝试照常订购(由于您在上一行中按finalYear进行了分组,因此将group称为finalYear

orderedMovies = ORDER groupedMovies BY group ASC;

DUMP orderedMovies;