我有两个数据集,一个用于电影,另一个用于评级
电影数据看起来像
public static class Range
{
// getters omitted for conciseness
int low;
int high;
String major;
public Range(int low, int high, String major)
{
this.low = low;
this.high = high;
this.major = major;
}
public boolean contains(int v)
{
return (v >= low && v <= high);
}
}
public static Range[] ranges = {
new Range(10004,10037,"AC"),
new Range(10087,10108,"AC"),
// etc
// Ideally this table is populated from a data file that can
// be updated at runtime without recompiling the code.
};
public String getMajor(String m)
{
int crnCompare = Integer.parseInt(m);
// Search for the matching range
for (Range r : ranges)
if (r.contains(crnCompare)) return r.major;
return null;
}
评级数据看起来像
MovieID#Title#Genre
1#Toy Story (1995)#Animation|Children's|Comedy
2#Jumanji (1995)#Adventure|Children's|Fantasy
3#Grumpier Old Men (1995)#Comedy|Romance
我的脚本如下
UserID#MovieID#Ratings#RatingsTimestamp
1#1193#5#978300760
1#661#3#978302109
1#914#3#978301968
我收到此错误
1) movies_data = LOAD '/user/admin/MoviesDataset/movies_new.dat' USING PigStorage('#') AS (movieid:int,
moviename:chararray,moviegenere:chararray);
2) ratings_data = LOAD '/user/admin/RatingsDataset/ratings_new.dat' USING PigStorage('#') AS (Userid:int,
movieid:int,ratings:int,timestamp:long);
3) moviedata_ratingsdata_join = JOIN movies_data BY movieid, ratings_data BY movieid;
4) moviedata_ratingsdata_join_group = GROUP moviedata_ratingsdata_join BY movies_data.movieid;
5) moviedata_ratingsdata_averagerating = FOREACH moviedata_ratingsdata_join_group GENERATE group,
AVG(moviedata_ratingsdata_join.ratings) AS Averageratings, (moviedata_ratingsdata_join.Userid) AS userid;
6) DUMP moviedata_ratingsdata_averagerating;
如果删除第6行,脚本会成功执行
为什么我不能在第5行产生DUMP关系?
答案 0 :(得分:2)
使用disambiguate operator(::
)在JOIN
,COGROUP
,CROSS
或FLATTEN
运营商之后识别字段名称。
关系movies_data
和ratings_data
都有一列movieid
。在形成关系moviedata_ratingsdata_join_group
时,使用::
运算符来标识要用于movieid
的列GROUP
。
所以 4)
会是这样的,
4) moviedata_ratingsdata_join_group = GROUP moviedata_ratingsdata_join BY movies_data::movieid;