我的2张桌子:
Posts.csv -
deadZones
Users.csv -
id
post_type
creationdate
score
viewcount
owneruserid
title
answercount
commentcount
我知道一些逻辑,我需要对owneruserid进行分组,然后在Posts.csv下计算id。
然后将其与User.csv链接,即从post.csv的owneruserid和users.csv的id加入。
请提供一些帮助。
答案 0 :(得分:0)
你已经列出了逻辑,建立在这些步骤的基础上。参见下面的脚本。加载数据,加入owneruserid,id然后由owneruserid.Foreach组分组生成post的计数。按降序排序最终结果获得最高排。
A = LOAD 'Posts.csv' USING PigStorage(',') AS (int id,chararray:post_type,chararray:creationdate,int:score,int:viewcount,int:owneruserid,chararray:title,int:answercount,int:commentcount);
B = LOAD 'Users.csv' USING PigStorage(',') AS (int:id,int:reputation,chararray:displayname,chararray:loc,int:age);
C = JOIN A BY (owneruserid), B BY (id);
D = GROUP C BY A.owneruserid;
E = FOREACH D GENERATE group as userid,B.displayname,COUNT(A.id) as TotalPosts;
F = ORDER E BY TotalPosts DESC;
G = LIMIT 1;