猪在分组

时间:2015-08-10 20:39:06

标签: apache-pig

我试着计算每个member_id的出现次数。 数据如下所示:(member_id,item_type)

2020292 Abc

2020292 Acd

2020292 Abc

2938201 CDE

然后输出类似于(id,count):

2020292 3

2938201 1

我尝试了以下内容:

data=FOREACH data GENERATE member_id, item_type;
grouping=group data by member_id;
count_elements=foreach grouping generate flatten(group) as member_id, COUNT(data) as num_elements;

我还尝试了类似于count_elements的代码,例如' foreach grouping generate member_id,COUNT(data)as num_elements;' 和foreach分组生成flatten(group)作为member_id,COUNT(data.item_type)作为num_elements;'而且没有人在工作。 任何帮助是极大的赞赏。 谢谢。

2 个答案:

答案 0 :(得分:0)

输入:

2020292,Abc
2020292,Acd
2020292,Abc
2938201,CDE

代码:

read = load 'test.data' using PigStorage(',') as (id:int,item_typ:chararray);
grouped_Data = group read by id;
describe grouped_Data;
count_val = foreach grouped_Data GENERATE group as (member_id:int),COUNT(read) as (rec_cnt:int);
dump count_val;

输出

(2020292,3)
(2938201,1)

答案 1 :(得分:0)

Jenny,我添加了您的问题的代码以及您在上述评论中提出的问题(@Learner的回答)。

输入数据:

2020292,Abc
2020292,Acd
2020292,Abc
2938201,CDE

id_list 的示例数据:

2020292
2020291
2020290

猪脚本:

data = LOAD '/pigsamples/groupdata' USING PigStorage(',') 
       AS (member_id:INT, item_type:CHARARRAY);
id_list_data = LOAD '/pigsamples/groupidlist' USING PigStorage(',') AS (member_id:INT);

group_data = GROUP data BY member_id;
count_grouped_data = FOREACH group_data GENERATE group AS member_id, COUNT(data) AS count;

join_data = JOIN count_grouped_data BY member_id, id_list_data BY member_id;

group_joined_data = FOREACH join_data GENERATE count_grouped_data::member_id 
                    AS id, count_grouped_data::count AS count_item_type;

<强>输出:

(2020292,3)