所以,我正在使用http流量条目处理日志文件。我正在尝试确定每个状态代码的每天每小时的记录数。 所以,我的想法输出将是这样的:
0 (200, 234) (201, 100) (404, 5553)
1 (200, 2234) (201, 1100) (404, 53)
....
我有以下转换:
e1 = group LINES BY (hour, statusCode);
e2 = foreach e1 generate group.hour, group.statusCode, COUNT(LINES);
e3 = group e2 by hour;
e4 = foreach e3 {
statusCount = foreach e2 generate statusCode, $2;
generate e3.group, statusCount;
};
当我尝试“转储e4”时,我收到以下错误消息:
Scalar在输出中有多行。第1名: (0,{(0,000,1),(0200951),(0,206,1),(0,302,4),(0,304,20),(0403118),(0,500,6)}), 第2 :(1,{(1200781),(1,301,1),(1,304,14),(1,400,1),(1403111),(1,502,12)})
正如你所看到的,价值观在那里,我只需要保存它们......但是如何?我试过做
e5 = foreach e4 generate group, statusCount;
但我得到了相同的输出。我知道我遗漏了一些基本的东西,但我无法弄清楚是什么......
-
答案 0 :(得分:1)
您可以轻松解决此问题,但挑战将采用您提到的输出格式。
<强>选项1:强>
如果是标准猪,您将始终获得以下输出格式(即包装将包含您的输出)。
<强> PigScript:强>
A = LOAD 'input' USING PigStorage() AS (hour:int, statusCode:chararray);
B = GROUP A BY (hour,statusCode);
C = FOREACH B GENERATE FLATTEN(group) AS (hour,statusCode),COUNT($1) AS cnt;
D = GROUP C BY hour;
E = FOREACH D GENERATE group,C.(statusCode,cnt);
STORE E INTO 'output' USING PigStorage();
<强>输出:强>
0 {(302,2),(304,3),(403,1),(500,1)}
1 {(200,1),(301,1),(304,2),(400,1),(403,1),(502,5)}
<强>选项2:强>
如果您想要实现所提到的输出格式,那么您必须使用来自BagToTuple
的自定义UDF piggybank.jar.
从此链接http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm下载jar文件并尝试以下方法。
<强> PigScript:强>
REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage() AS (hour:int, statusCode:chararray);
B = GROUP A BY (hour,statusCode);
C = FOREACH B GENERATE FLATTEN(group) AS (hour,statusCode),COUNT($1) AS cnt;
D = GROUP C BY hour;
E = FOREACH D {
mytuple = FOREACH C GENERATE TOTUPLE(statusCode,cnt);
GENERATE group,FLATTEN(BagToTuple(mytuple));
}
STORE E INTO 'output1' USING PigStorage();
<强>输出:强>
0 (302,2) (304,3) (403,1) (500,1)
1 (200,1) (301,1) (304,2) (400,1) (403,1) (502,5)
传递给脚本的示例输入:
<强>输入强>
0 302
0 302
0 304
0 304
0 304
0 403
0 500
1 200
1 301
1 304
1 304
1 400
1 403
1 502
1 502
1 502
1 502
1 502