Pig中字段的值计数

时间:2017-03-08 14:31:04

标签: apache-pig

我有以下测试数据。

A   B   C

M   O

M   M   M

M   M   M

N       O

P       N

我想获得此样本测试数据中的条目总数,即12

我有以下代码来做同样的事情,但我的结果不正确。

任何有关如何纠正的帮助都会有所帮助。

test=  LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray); 
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;  
grp = GROUP values ALL;  
counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C); 

答案是15,而不是12。

我还想得到每个值的计数,如M = 7,N = 2,O = 2,P = 1。 我写了下面的代码。

test=  LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray); 
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;  
grp = GROUP values ALL;  
    A = FOREACH grp {
B =FILTER test.A=='M' OR test.B=='M' OR test.C=='M';
GENERATE group, COUNT(B);
};

我收到错误" Scalar在输出中有多行"。

1 个答案:

答案 0 :(得分:2)

您正在计算最终计数中的列名。修改脚本以忽略第一行,然后按分组计数。

test=  LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray); 

ranked = rank test;
test1 = Filter ranked by ($0 > 1); --Note:rank_test should work.

values = FOREACH test1 GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;  
grp = GROUP values ALL;  
counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C);