Pig:如何在" Scalar在输出中有两行以上时保存关系"

时间:2015-01-15 18:57:21

标签: hadoop apache-pig

所以,我正在使用http流量条目处理日志文件。我正在尝试确定每个状态代码的每天每小时的记录数。 所以,我的想法输出将是这样的:

0 (200, 234) (201, 100) (404, 5553)
1 (200, 2234) (201, 1100) (404, 53)
....

我有以下转换:

e1 = group LINES BY (hour, statusCode);
e2 = foreach e1 generate group.hour, group.statusCode, COUNT(LINES);
e3 = group e2 by hour;
e4 = foreach e3 {
    statusCount = foreach e2 generate statusCode, $2;
    generate e3.group, statusCount;
};

当我尝试“转储e4”时,我收到以下错误消息:

  

Scalar在输出中有多行。第1名:   (0,{(0,000,1),(0200951),(0,206,1),(0,302,4),(0,304,20),(0403118),(0,500,6)}),   第2   :(1,{(1200781),(1,301,1),(1,304,14),(1,400,1),(1403111),(1,502,12)})

正如你所看到的,价值观在那里,我只需要保存它们......但是如何?我试过做

e5 = foreach e4 generate group, statusCount;

但我得到了相同的输出。我知道我遗漏了一些基本的东西,但我无法弄清楚是什么......

-

1 个答案:

答案 0 :(得分:1)

您可以轻松解决此问题,但挑战将采用您提到的输出格式。

<强>选项1:
如果是标准猪,您将始终获得以下输出格式(即包装将包含您的输出)。

<强> PigScript:

A = LOAD 'input' USING PigStorage() AS (hour:int, statusCode:chararray);
B = GROUP A BY (hour,statusCode);
C = FOREACH B GENERATE FLATTEN(group) AS (hour,statusCode),COUNT($1) AS cnt;
D = GROUP C BY hour;
E = FOREACH D GENERATE group,C.(statusCode,cnt);
STORE E INTO 'output' USING PigStorage();

<强>输出:

0   {(302,2),(304,3),(403,1),(500,1)}
1   {(200,1),(301,1),(304,2),(400,1),(403,1),(502,5)}

<强>选项2:
如果您想要实现所提到的输出格式,那么您必须使用来自BagToTuple的自定义UDF piggybank.jar.从此链接http://www.java2s.com/Code/Jar/p/Downloadpiggybankjar.htm下载jar文件并尝试以下方法。

<强> PigScript:

REGISTER '/tmp/piggybank.jar';
A = LOAD 'input' USING PigStorage() AS (hour:int, statusCode:chararray);
B = GROUP A BY (hour,statusCode);
C = FOREACH B GENERATE FLATTEN(group) AS (hour,statusCode),COUNT($1) AS cnt;
D = GROUP C BY hour;
E = FOREACH D {
                   mytuple = FOREACH C GENERATE TOTUPLE(statusCode,cnt);
                   GENERATE group,FLATTEN(BagToTuple(mytuple));
              }
STORE E INTO 'output1' USING PigStorage();

<强>输出:

0   (302,2) (304,3) (403,1) (500,1)
1   (200,1) (301,1) (304,2) (400,1) (403,1) (502,5)

传递给脚本的示例输入:

<强>输入

0       302
0       302
0       304
0       304
0       304
0       403
0       500
1       200
1       301
1       304
1       304
1       400
1       403
1       502
1       502
1       502
1       502
1       502