猪 - 无法使用DUMP数据

时间:2017-03-25 07:15:33

标签: hadoop apache-pig

我有两个数据集,一个用于电影,另一个用于评级

电影数据看起来像

public static class Range
{
    // getters omitted for conciseness
    int low;
    int high;
    String major;
    public Range(int low, int high, String major)
    {
        this.low  = low;
        this.high  = high;
        this.major = major;
    }
    public boolean contains(int v)
    {
        return (v >= low && v <= high);
    }
}

public static Range[] ranges = {
        new Range(10004,10037,"AC"),
        new Range(10087,10108,"AC"),
        // etc
        // Ideally this table is populated from a data file that can
        // be updated at runtime without recompiling the code.
};

public String getMajor(String m)
{
    int crnCompare = Integer.parseInt(m);
    // Search for the matching range
    for (Range r : ranges)
        if (r.contains(crnCompare)) return r.major;
    return null;
}

评级数据看起来像

MovieID#Title#Genre
1#Toy Story (1995)#Animation|Children's|Comedy
2#Jumanji (1995)#Adventure|Children's|Fantasy
3#Grumpier Old Men (1995)#Comedy|Romance

我的脚本如下

UserID#MovieID#Ratings#RatingsTimestamp
1#1193#5#978300760
1#661#3#978302109
1#914#3#978301968

我收到此错误

    1) movies_data = LOAD '/user/admin/MoviesDataset/movies_new.dat' USING PigStorage('#') AS (movieid:int,
    moviename:chararray,moviegenere:chararray);

    2) ratings_data = LOAD '/user/admin/RatingsDataset/ratings_new.dat' USING PigStorage('#') AS (Userid:int,
    movieid:int,ratings:int,timestamp:long);

    3) moviedata_ratingsdata_join = JOIN movies_data BY movieid, ratings_data BY movieid;

    4) moviedata_ratingsdata_join_group = GROUP moviedata_ratingsdata_join BY movies_data.movieid;

    5) moviedata_ratingsdata_averagerating = FOREACH moviedata_ratingsdata_join_group GENERATE group,
    AVG(moviedata_ratingsdata_join.ratings) AS Averageratings, (moviedata_ratingsdata_join.Userid) AS userid;

    6) DUMP moviedata_ratingsdata_averagerating;

如果删除第6行,脚本会成功执行

为什么我不能在第5行产生DUMP关系?

1 个答案:

答案 0 :(得分:2)

使用disambiguate operator::)在JOINCOGROUPCROSSFLATTEN运营商之后识别字段名称。

关系movies_dataratings_data都有一列movieid。在形成关系moviedata_ratingsdata_join_group时,使用::运算符来标识要用于movieid的列GROUP

所以 4) 会是这样的,

4) moviedata_ratingsdata_join_group = GROUP moviedata_ratingsdata_join BY movies_data::movieid;