我想这个问题跟这个问题类似:
Selecting fields after grouping in Pig
但是我的问题是以下组成的样本数据:
user_name,movie_name,company,rating
Jim,Jaws,A,4
吉姆,棒球,B,4Matt,Halo,A,5
马特,棒球,B,4
马特,主席历史,B,3.5
帕特,主席历史,B,3
约翰,主席历史,B,2
Frank,Battle Tanks,A,3
弗兰克,主席历史,B,5
如何将用户看过的所有电影组合在一起,而不会丢失公司和评级等其他信息。
我想添加用户从电影公司A和电影公司B获得的所有评分的交叉。
Jim,Jaws,棒球,8
Matt,Halo,棒球,9
弗兰克,战斗坦克,主席历史,8
将是格式的输出:
用户,公司A,公司B,评级
我从加载开始,然后是
r1 = LOAD 'data.csv' USING PigStorage(',') as (user_name:chararray, movie_name:chararray, company_name:chararray, rating:int);
r2 = group r1 by user_name;
r3 = foreach r2 generate group as user_name, flatten(r1);
r4A = filter r3 by company_name == 'A';
r4B = filter r3 by company_name == 'B';
然后我有类似
的东西(Frank,Frank,Battle Tanks,A,3)
然后我打算做一个r4A和r4B的交叉和等级的总和。但我不确定重复的user_name是否会增加效率低下。
这是正确的做法吗?有什么想法让这更好吗? 任何帮助将不胜感激!
答案 0 :(得分:0)
你能试试吗?
<强>输入强>
Jim,Jaws,A,4
Jim,Baseball,B,4
Matt,Halo,A,5
Matt,Baseball,B,4
Matt,History of Chairs,B,3.5
Pat,History of Chairs,B,3
John,History of Chairs,B,2
Frank,Battle Tanks,A,3
Frank,History of Chairs,B,5
<强> PigScript:强>
A = LOAD 'input' USING PigStorage(',') AS (user_name:chararray, movie_name:chararray, company:chararray, rating:float);
B = GROUP A BY user_name;
C = FOREACH B {
filterCompanyA = FILTER A BY company=='A';
sumA = SUM(filterCompanyA.rating);
filterCompanyB = FILTER A BY company=='B';
sumB = SUM(filterCompanyB.rating);
GENERATE group AS user,
FLATTEN(REPLACE(BagToString(filterCompanyA.movie_name),'_',',')) AS companyA,
FLATTEN(REPLACE(BagToString(filterCompanyB.movie_name),'_',',')) AS companyB,
(((sumA is null)?0:sumA)+((sumB is null)?0:sumB)) AS Rating;
}
D = FOREACH C GENERATE user,companyA,companyB,Rating;
DUMP D;
<强>输出:强>
(Jim,Jaws,Baseball,8.0)
(Pat,,History of Chairs,3.0)
(John,,History of Chairs,2.0)
(Matt,Halo,Baseball,History of Chairs,12.5)
(Frank,Battle Tanks,History of Chairs,8.0)
在上面的输出Pat and John
中没有看到CompanyA中的任何电影,因此输出为空即为空