计算pig查询中的分组记录

时间:2015-06-15 12:15:31

标签: apache-pig

以下是我的测试数据。

John,q1,Correct
Jack,q1,wrong
John,q2,Correct
Jack,q2,wrong
John,q3,wrong
Jack,q3,Correct
John,q4,wrong
Jack,q4,wrong
John,q5,wrong
Jack,q5,wrong

我想找到类似下面的内容:

John wrong  4
John correct 1
Jack wrong  3
Jack correct 2

我的代码:

data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
       name:chararray, 
       number:chararray,
       result:chararray);
B = GROUP data by (name,result);

现在输出如下:

((John,wrong),{(John,q5,wrong),(John,q4,wrong),(John,q2,wrong),(John,q1,wrong)})
((John,Correct),{(John,q3,Correct)})
((Jack,wrong),{(Jack,q5,wrong),(Jack,q4,wrong),(Jack,q3,wrong)})
((Jack,Correct),{(Jack,q2,Correct),(Jack,q1,Correct)})

我应该如何计算分组记录的数量。

1 个答案:

答案 0 :(得分:3)

COUNT功能会为您提供包中元素的数量,这正是您想要的。在按userresult进行分组后,您最终会得到一个包含每个组合出现次数的包。

因此,您只需添加一行:

data = LOAD '/stackoverflowq4.txt' USING PigStorage(',') AS (
   name:chararray, 
   number:chararray,
   result:chararray);
B = GROUP data by (name,result);
C = foreach B generate FLATTEN(group) as (name,result), COUNT(data) as count;

dump D;
(Jack,wrong,4)
(Jack,Correct,1)
(John,wrong,3)
(John,Correct,2)

FLATTEN(group)是因为在分组之后,会生成包含您按分组的元素的元组,并且根据您想要输出的内容的外观,您不希望它在元组内部,作为输出就像((Jack,wrong),4)