我正在尝试创建一个表格,以百分比形式显示出现次数。例如:我有一个名为 example 的表,其中包含以下数据:
class, value
------ -------
1 , abc
1 , abc
1 , xyz
1 , abc
2 , xyz
2 , abc
这里,对于类值1,'abc'出现3次,'xyz'仅出现4次总出现次数。对于类值2,'abc'和'xyz'出现一次(总共出现两次)。
所以,输出是:
class, %_of_abc, %_of_xyz
------ -------- --------
1 , 75 , 25
2 , 50 , 50
知道如何在列值发生变化时如何做到这一点?我想用 GROUP 来做。但不确定我是否按类别对它进行分组,它对我有什么帮助。
答案 0 :(得分:0)
有点复杂,但这里的解决方案
grunt> Dump A;
(1,abc)
(1,abc)
(1,xyz)
(1,abc)
(2,xyz)
(2,abc)
grunt> B = Group A by class;
grunt> C = foreach B generate group as class:int, COUNT(A) as cnt;
grunt> D = Group A by (class,value);
grunt> E = foreach D generate FLATTEN(group), COUNT(A) as tot_cnt;
grunt> F = foreach E generate $0 as class:int, $1 as value:chararray, tot_cnt;
grunt> G = JOIN F BY class,C BY class;
grunt> H = foreach G generate $0 as class,$1 as value,($2*100/$4) as perc;
grunt> Dump H;
(1,xyz,25)
(1,abc,75)
(2,xyz,50)
(2,abc,50)
I = grouy H by class;
J = FOREACH I generate group as class, FLATTEN(BagToTuple(H.perc));
Dump J;
(1,75,25)
(2,50,50)