假设我有一份餐厅评论数据集:
User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5
我想根据用户和城市的平均评论生成一个列表。即输出:
User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75
我可以编写一个Pig脚本,如下所示:
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float
);
PerUserCity = GROUP Data BY (user, city);
ResultSet = FOREACH PerUserCity {
GENERATE group.user, group.city, AVG(Data.rating);
}
但是我很好奇我是否可以先将较高级别的小组(用户)分组,然后再将小组(城市)分组:即
PerUser = GROUP Data BY user;
Intermediate = FOREACH PerUser {
B = GROUP Data BY city;
GENERATE group AS user, B;
}
我明白了:
Error during parsing.
Invalid alias: GROUP in {
group: chararray,
Data: {
user: chararray,
city: chararray,
restaurant: chararray,
rating: float
}
}
有没有人试过这个成功的?是否根本无法在FOREACH中进行GROUP?
我的目标是做一些事情:
ResultSet = FOREACH PerUser {
FOREACH City {
GENERATE user, city, AVG(City.rating)
}
}
答案 0 :(得分:8)
目前,FOREACH内允许的操作为DISTINCT
,FILTER
,LIMIT
和ORDER BY
。
现在直接按(用户,城市)进行分组是你说的好方法。
答案 1 :(得分:2)
Pig版本0.10的发行说明表明嵌套的FOREACH操作是now supported。
答案 2 :(得分:1)
试试这个:
Records = load 'data_rating.txt' using PigStorage(',') as (user:chararray, city:chararray, restaurant:chararray, rating:float);
grpRecs = group Records By (user,city);
avgRating_Byuser_perCity = foreach grpRecs generate AVG(Records.rating) as average;
Result = foreach avgRating_Byuser_perCity generate flatten(group), average;
答案 3 :(得分:0)
awdata = load 'data' using PigStorage(',') as (user:chararray , city:chararray , restaurant:chararray , rating:float);
data = filter rawdata by user != 'User';
groupbyusercity = group data by (user,city);
--describe groupbyusercity;
--groupbyusercity: {group: (user: chararray,city: chararray),data: {(user: chararray,city: chararray,restaurant: chararray,rating: float)}}
average = foreach groupbyusercity {
generate group.user,group.city,AVG(data.rating);
}
dump average;
答案 4 :(得分:0)
按两个键分组,然后展平结构会产生相同的结果:
像你一样加载数据
Data = LOAD 'data.txt' USING PigStorage(',') AS (
user:chararray, city:chararray, restaurant:charray, rating:float);
按用户和城市分组
ByUserByCity = GROUP Data BY (user, city);
添加组的评分平均值(您可以添加更多,例如COUNT(数据)作为count_res) 然后将组结构展平为原始结构。
ByUserByCityAvg = FOREACH ByUserByCity GENERATE
FLATTEN(group) AS (user, city),
AVG(Data.rating) as user_city_avg;
结果:
Jim,London,2.0
Jim,New York,3.75
Lisa,London,3.75
User,City,