我使用Cloudera - quickstat 5.4。我有一个文件,每行包含数据,如:
323.81.303.680 - - [25 / Oct / 2011:01:41:00 -0500]" GET /download/download6.zip HTTP / 1.1" 200 0" - " " Mozilla / 5.0(Windows; U; Windows NT 5.1; EN-US; rv:1.9.0.19)Gecko / 2010031422 Firefox / 3.0.19"
在apache pig中,我使用的脚本如下:
A= LOAD 'weblog.txt' using TextLoader() as (line:chararray);
B= FOREACH A GENERATE
FLATTEN(REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] “(.+?)” (\\S+) (\\S+) “([^”]*)” “([^”]*)”')) AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time:chararray, request: chararray, status:int,bytes_string:chararray,referrer: chararray, browser: chararray);
DUMP B;
上述查询的输出提供了类似
的输出()
()
有谁能告诉我我做错了什么?是正则表达式吗?
答案 0 :(得分:0)
在chararray之后和, line
之前添加;
:
A= LOAD 'weblog.txt' using TextLoader() as (line:chararray);
B= FOREACH A GENERATE FLATTEN(
REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'))
AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time:chararray, request: chararray, status:int,bytes_string:chararray,referrer: chararray, browser: chararray)
, line;
DUMP B;
对于正则表达式,它与示例字符串匹配良好,请参阅regex demo。