我试着计算每个member_id的出现次数。 数据如下所示:(member_id,item_type)
2020292 Abc
2020292 Acd
2020292 Abc
2938201 CDE
然后输出类似于(id,count):
2020292 3
2938201 1
我尝试了以下内容:
data=FOREACH data GENERATE member_id, item_type;
grouping=group data by member_id;
count_elements=foreach grouping generate flatten(group) as member_id, COUNT(data) as num_elements;
我还尝试了类似于count_elements的代码,例如' foreach grouping generate member_id,COUNT(data)as num_elements;' 和foreach分组生成flatten(group)作为member_id,COUNT(data.item_type)作为num_elements;'而且没有人在工作。 任何帮助是极大的赞赏。 谢谢。
答案 0 :(得分:0)
输入:
2020292,Abc
2020292,Acd
2020292,Abc
2938201,CDE
代码:
read = load 'test.data' using PigStorage(',') as (id:int,item_typ:chararray);
grouped_Data = group read by id;
describe grouped_Data;
count_val = foreach grouped_Data GENERATE group as (member_id:int),COUNT(read) as (rec_cnt:int);
dump count_val;
输出
(2020292,3)
(2938201,1)
答案 1 :(得分:0)
Jenny,我添加了您的问题的代码以及您在上述评论中提出的问题(@Learner的回答)。
输入数据:
2020292,Abc
2020292,Acd
2020292,Abc
2938201,CDE
id_list 的示例数据:
2020292
2020291
2020290
猪脚本:
data = LOAD '/pigsamples/groupdata' USING PigStorage(',')
AS (member_id:INT, item_type:CHARARRAY);
id_list_data = LOAD '/pigsamples/groupidlist' USING PigStorage(',') AS (member_id:INT);
group_data = GROUP data BY member_id;
count_grouped_data = FOREACH group_data GENERATE group AS member_id, COUNT(data) AS count;
join_data = JOIN count_grouped_data BY member_id, id_list_data BY member_id;
group_joined_data = FOREACH join_data GENERATE count_grouped_data::member_id
AS id, count_grouped_data::count AS count_item_type;
<强>输出:强>
(2020292,3)