PIG:如何创建基于百分比(%)的表?

时间:2016-08-07 02:42:52

标签: apache-pig

我正在尝试创建一个表格,以百分比形式显示出现次数。例如:我有一个名为 example 的表,其中包含以下数据:

class, value
------ -------
1     ,  abc
1     ,  abc
1     ,  xyz
1     ,  abc
2     ,  xyz
2     ,  abc

这里,对于类值1,'abc'出现3次,'xyz'仅出现4次总出现次数。对于类值2,'abc'和'xyz'出现一次(总共出现两次)。

所以,输出是:

class, %_of_abc, %_of_xyz
------ --------  --------
1     ,  75     ,   25
2     ,  50     ,   50

知道如何在列值发生变化时如何做到这一点?我想用 GROUP 来做。但不确定我是否按类别对它进行分组,它对我有什么帮助。

1 个答案:

答案 0 :(得分:0)

有点复杂,但这里的解决方案

grunt> Dump A;
(1,abc)
(1,abc)
(1,xyz)
(1,abc)
(2,xyz)
(2,abc)
grunt> B = Group A by class;
grunt> C = foreach B generate group as class:int, COUNT(A) as cnt;
grunt> D = Group A by (class,value);          
grunt> E = foreach D generate FLATTEN(group), COUNT(A) as tot_cnt;
grunt> F = foreach E generate $0 as class:int, $1 as value:chararray, tot_cnt;
grunt> G = JOIN F BY class,C BY class;
grunt> H = foreach G generate $0 as class,$1 as value,($2*100/$4) as perc;
grunt> Dump H;
(1,xyz,25)
(1,abc,75)
(2,xyz,50)
(2,abc,50)
I = grouy H by class;
J = FOREACH I generate group as class, FLATTEN(BagToTuple(H.perc));
Dump J;
(1,75,25)
(2,50,50)