我正在尝试使用Apache Pig来表征具有某些属性的行的分数。
例如,如果数据如下:
a,15
a,16
a,17
b,3
b,16
我想得到:
a,0.6
b,0.4
我正在尝试执行以下操作:
A = LOAD 'my file' USING PigStorage(',');
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
这给了我total =(5),但是当我尝试使用这个'total'时:
fractions = FOREACH (GROUP A by $0) GENERATE COUNT(A)/total;
我收到错误。
显然COUNT()返回某种投影,两个投影(计算总数和分数)应该是一致的。有没有办法让这项工作?或者也许只是将总数转换为数字并避免这种投影一致性要求?
答案 0 :(得分:1)
你必须将它投射并投射到双倍:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group,COUNT(A);
fractions = FOREACH rows GENERATE rows.$0,(double)rows.$1/(double)total.$0;
答案 1 :(得分:1)
另一种方法:
test = LOAD 'test.txt' USING PigStorage(',') AS (one:chararray,two:int);
B = GROUP test by $0;
C = FOREACH B GENERATE group, COUNT(test.$0);
D = GROUP test ALL;
E = FOREACH D GENERATE group,COUNT(test.$0);
F = CROSS C,E;
G = FOREACH F GENERATE $0,$1,$3,(double)($1*100/$3);
Output:
(a,3,5,0.6)
(b,2,5,0.4)
答案 2 :(得分:0)
出于某种原因,以下对@ inquisitive-mind建议的修改有效:
total = FOREACH (GROUP A ALL) GENERATE COUNT(A);
rows = FOREACH (GROUP A by $0) GENERATE group as colname, COUNT(A) as cnt;
fractions = FOREACH rows GENERATE colname, cnt/(double)total.$0;