我有日志文件,
116.48.29.143 - - [01 / Oct / 2013:20:28:21 +0530]“GET /test.php HTTP / 1.1”200 749“ - ”“Mozilla / 4.0(兼容; MSIE 6.0; Windows NT) 5.1; SV1; .NET CLR 1.1.4322)“
145.89.87.211 - - [01 / Oct / 2013:20:28:21 +0530]“GET /test.php HTTP / 1.1”200 613“ - ”“Mozilla / 4.0(兼容; MSIE 6.0; Windows NT) 5.1; SV1; .NET CLR 1.1.4322)“
REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
DEFINE CustomFormatToISOorg.apache.pig.piggybank.evaluation.datetime.convert.CustomFormatToISO();
DEFINE ISOToUnix org.apache.pig.piggybank.evaluation.datetime.convert.ISOToUnix();
DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();
DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();
DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT();
A = LOAD 'input' USING TextLoader AS (line:chararray);
B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"') ) AS (remoteAddr:chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray);
我的问题是提取分钟,我的意思是从[2013年10月1日:20:28:21 +0530]我需要得到的只有28:21
我该如何提取?
答案 0 :(得分:0)
您已经知道如何编写正则表达式,那么为什么不修改您的正则表达式或编写新表达式?这是一个新的:
C = FOREACH A GENERATE REGEX_EXTRACT(time, '[\\w/]+:\\d{2}([\\d:]+)\\s[+\\-]\\d{4}') AS hourMin;
答案 1 :(得分:0)
我为这种应用程序编写了一个特殊的Pig Loader。 这样可以更轻松地解析Apache HTTPD日志文件。
你可以在你的猪应用程序中使用它,看起来像这样:
REGISTER httpdlog-pigloader-1.0-SNAPSHOT-job.jar
Clicks =
LOAD 'access_log.gz'
USING nl.basjes.pig.input.apachehttpdlog.Loader(
'%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"',
'IP:connection.client.host',
'TIME.MINUTE:request.receive.time.minute',
'HTTP.URI:request.firstline.uri',
'STRING:request.firstline.uri.query.foo',
'STRING:request.status.last',
'HTTP.URI:request.referer',
'STRING:request.referer.query.foo',
'HTTP.USERAGENT:request.user-agent')
AS (
ConnectionClientHost,
RequestReceiveTimeMinute,
RequestFirstlineUri,
RequestFirstlineUriQueryFoo,
RequestStatusLast,
RequestReferer,
RequestRefererQueryFoo,
RequestUseragent);