我有以下测试数据。
A B C
M O
M M M
M M M
N O
P N
我想获得此样本测试数据中的条目总数,即12
我有以下代码来做同样的事情,但我的结果不正确。
任何有关如何纠正的帮助都会有所帮助。
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C);
答案是15,而不是12。
我还想得到每个值的计数,如M = 7,N = 2,O = 2,P = 1。 我写了下面的代码。
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
values = FOREACH test GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
A = FOREACH grp {
B =FILTER test.A=='M' OR test.B=='M' OR test.C=='M';
GENERATE group, COUNT(B);
};
我收到错误" Scalar在输出中有多行"。
答案 0 :(得分:2)
您正在计算最终计数中的列名。修改脚本以忽略第一行,然后按分组计数。
test= LOAD 'testdata' USING PigStorage(',') as (A:chararray,B:chararray,C:chararray);
ranked = rank test;
test1 = Filter ranked by ($0 > 1); --Note:rank_test should work.
values = FOREACH test1 GENERATE A==''?'null':(A is null?'null':A)) as A,(B==''?'null':(B is null?'null':B)) as B,(C==''?'null':(C is null?'null':C)) as C;
grp = GROUP values ALL;
counting = FOREACH grp GENERATE group, COUNT(values.A)+COUNT(values.B)+COUNT(values.C);