使用REGEX_EXTRACT_ALL但是在投影中我得到"()"

时间:2015-11-03 12:26:01

标签: regex hadoop apache-pig

我使用Cloudera - quickstat 5.4。我有一个文件,每行包含数据,如:

  

323.81.303.680 - - [25 / Oct / 2011:01:41:00 -0500]" GET /download/download6.zip HTTP / 1.1" 200 0" - " " Mozilla / 5.0(Windows; U;   Windows NT 5.1; EN-US; rv:1.9.0.19)Gecko / 2010031422 Firefox / 3.0.19"

在apache pig中,我使用的脚本如下:

A= LOAD 'weblog.txt' using TextLoader() as (line:chararray);
B= FOREACH A GENERATE 
FLATTEN(REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] “(.+?)” (\\S+) (\\S+) “([^”]*)” “([^”]*)”')) AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time:chararray, request: chararray, status:int,bytes_string:chararray,referrer: chararray, browser: chararray);

DUMP B;

上述查询的输出提供了类似

的输出

()
()

有谁能告诉我我做错了什么?是正则表达式吗?

1 个答案:

答案 0 :(得分:0)

在chararray之后和, line之前添加;

A= LOAD 'weblog.txt' using TextLoader() as (line:chararray);
B= FOREACH A GENERATE FLATTEN(
    REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"')) 
     AS (remoteAddr: chararray, remoteLogname: chararray, user: chararray, time:chararray, request: chararray, status:int,bytes_string:chararray,referrer: chararray, browser: chararray)
       , line;

DUMP B;

对于正则表达式,它与示例字符串匹配良好,请参阅regex demo