Apache Pig的基本统计信息

时间:2017-03-20 22:45:38

标签: apache-pig

我正在尝试使用Apache Pig来表征具有某些属性的行的分数。

例如,如果数据如下:

    a,15
    a,16
    a,17
    b,3
    b,16

我想得到:

    a,0.6
    b,0.4

我正在尝试执行以下操作:

    A = LOAD 'my file' USING PigStorage(',');
    total = FOREACH (GROUP A ALL) GENERATE COUNT(A);

这给了我total =(5),但是当我尝试使用这个'total'时:

    fractions = FOREACH (GROUP A by $0) GENERATE COUNT(A)/total;

我收到错误。

显然COUNT()返回某种投影,两个投影(计算总数和分数)应该是一致的。有没有办法让这项工作?或者也许只是将总数转换为数字并避免这种投影一致性要求?

3 个答案:

答案 0 :(得分:1)

你必须将它投射并投射到双倍:

total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group,COUNT(A);
fractions = FOREACH rows GENERATE rows.$0,(double)rows.$1/(double)total.$0;

答案 1 :(得分:1)

另一种方法:

test = LOAD 'test.txt' USING PigStorage(',') AS (one:chararray,two:int);
B = GROUP test by $0;
C = FOREACH B GENERATE group, COUNT(test.$0);
D = GROUP test ALL;
E = FOREACH D GENERATE group,COUNT(test.$0);
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,(double)($1*100/$3);

Output:
(a,3,5,0.6)
(b,2,5,0.4)

答案 2 :(得分:0)

出于某种原因,以下对@ inquisitive-mind建议的修改有效:

  total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
  rows = FOREACH (GROUP A by $0) GENERATE group as colname, COUNT(A) as cnt;
  fractions = FOREACH rows GENERATE colname, cnt/(double)total.$0;