猪:REPLACE的问题

时间:2016-01-05 07:33:03

标签: replace foreach apache-pig

以下是我的数据:

199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245

unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985

199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085

burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0

以下是Pig代码:

loadFulldata = LOAD '/root/Kennadi-Project/Kennadi-data.txt' USING PigStorage(',') AS (fullline:chararray);

extractData = FOREACH loadFulldata GENERATE FLATTEN (REGEX_EXTRACT_ALL(fullline,'(.*) - - (.*) -(.*)] "(.*)" (.*) (.*)'));

rowdata = FOREACH extractData GENERATE $0 as host,$1 as datetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;

我的extractData看起来像:

(199.72.81.55,[01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,[01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,[01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,[01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)
(199.120.110.21,[01/Jul/1995:00:00:11,0400,GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0,200,4179)
(burger.letters.com,[01/Jul/1995:00:00:12,0400,GET /images/NASA-logosmall.gif HTTP/1.0,304,0)

当我使用REGEX_EXTRACT_ALL时,我无法从数据中删除'[',我该如何实现?

此外,我尝试删除'['使用REPLACE函数,如下所示:

rowdata = FOREACH extractData GENERATE $0 as host,$1 as datadatetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;

newdata = FOREACH rowdata GENERATE REPLACE(datadatetime,'[','');

但我收到了警告:

2016-01-05 05:10:13,758 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s).
2016-01-05 05:10:13,758 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).

我认为这是因为我没有为datadatetime定义任何数据类型,如何在foreach中定义数据类型?

1 个答案:

答案 0 :(得分:0)

你有问题。您尝试使用正则表达式解决它。你现在有两个问题。

严重的是,在尝试之后,这似乎只是正则表达式的一个问题。使用

REGEX_EXTRACT_ALL(fullline,'(.*) - - \\[(.*) -(.*)\\] "(.*)" (.*) (.*)')

为我做了诀窍。

结果:

(199.72.81.55,01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)