Nginx使用SerDe通过Hive进行日志记录解析

时间:2014-10-16 07:19:53

标签: regex hadoop nginx hive

我目前正在使用当前的配置单元脚本解析自定义的nginx日志:

add jar s3://my-bucket-foo/hive-serde-0.13.1.jar;
SET hive.mapred.supports.subdirectories=true;
SET mapred.input.dir.recursive=true;
set hive.exec.compress.intermediate=true;
set mapred.compress.map.output=true;
set hive.exec.parallel=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec;

DROP TABLE nginx_logs ;
CREATE EXTERNAL TABLE nginx_logs (
IP STRING,
`Timestamp` STRING,
Verb STRING,
URL STRING,
HTTPVersion STRING,
RequestProcessingTime STRING,
ReceivedBytes STRING,
URLReferer STRING,
UserAgent STRING,
MSISDN STRING,
XCALL STRING,
ResponseCode STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (

<击>&#34; input.regex&#34; =&#34;(\ d {1,3}。\ d {1,3}。\ d {1,3}。\ d {1,3})\ s + - \ s + - \ s + [(\ d \ {2} / [AZ] {3} / \ d {4}:\ d {2}:\ d {2}:\ d {2} \ S +) - \ d {4}] \ S + \&#34 ;(GET)(+)(HTTP / 1.1 \&#34;)\ S +(。\ d {1,} \ d {3})\ S +(\ d +)\ S + \&#34;([^ \&#34;] +)\&#34; \ S +剂[\&#34;([^ \&#34;] +)\&#34;。] \ S + - \ S + \ S + MSISDN [([^]] +)] \ S + XCALL [([^]] +)] \ S +(\ d {1,})/ GMI&#34;

"input.regex" = "(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})\\s+-\\s+-\\s+\\[(\\d{2}\\/[a-z]{3}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2}\\s+)-\\d{4}\\]\\s+\"(GET)(.+)(http\\/1\\.1\")\\s+(\\d{1,}\\.\\d{3})\\s+(\\d+)\\s+\"([^\"]+)\"\\s+agent\\[\"([^\"]+)\"\\]\\s+-\\s+\\.\\s+msisdn\\[([^\\]]+)\\]\\s+xcall\\[([^\\]]+)\\]\\s+(\\d{1,}).*"
    )

LOCATION 's3n://my-bucket/EMRInput/';

这里有一些日志行和一个浏览器示例:http://regex101.com/r/tW8yT5/1 样品行:

192.168.0.143 - - [25/Sep/2014:19:17:40 -0300]  "GET /adserver/www/delivery/lg.php?bannerid=4512&campaignid=374&zoneid=40&loc=1&cb=2b674aefb7 HTTP/1.1" 0.000  43 "http://wap.tim.com.br/html5/" Agent["Mozilla/5.0 (Linux; U; Android 4.1.2; pt-br; LG-E467f Build/JZO54K) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30"] - . msisdn[-] xcall[552199999955] 200

根据 regexp101 ,有12个匹配组: Matching Groups

但每当我执行查询时:

<击>     select * from nginx_logs limit 10;

我收到一个错误,告诉我匹配的组数量与列数不匹配。

hive> select * from nginx_logs limit 10;
OK
Failed with exception java.io.IOException:org.apache.hadoop.hive.serde2.SerDeException: Number of matching groups doesn't match the number of columns
Time taken: 0.036 seconds

<击>

我只是双重逃过了\(反斜杠)而现在而不是错误我得到了:

hive> select * from nginx_logs limit 1;
OK
NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL    NULL
Time taken: 0.037 seconds, Fetched: 1 row(s)

有关于此的任何想法吗?

1 个答案:

答案 0 :(得分:0)

在了解了SerDe和Hive如何处理正则表达式后,我在第一个匹配组中只考虑了一个IP:

(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})

这适用于大多数情况,除了那些有两个或更多ip(代理等)的情况,所以

200.222.108.241, 200.222.108.241, 200.222.108.241 - - [04/Oct/2014:06:30:48 -0300]  "GET /wml/redirect/jogos.wml HTTP/1.1" 0.000  154 "-" Agent["SAMSUNG-GT-E2222L/1.0 NetFront/4.1 Profile/MIDP-2.0 Configuration/CLDC-1.1"] - . msisdn[-] xcall[-] 302

不行,给我们带来很多麻烦。解决方案来得非常快,通过使用一组来匹配从开头到短划线( - )的所有内容:

([^-]*)\\s+-\\s+-\\s+\\[(\\d{2}\\/[a-zA-Z]{3}\\/\\d{4}:\\d{2}:\\d{2}:\\d{2}\\s+)-\\d{4}\\]\\s+\"(GET)(.+)(HTTP\\/1\\.1\")\\s+(\\d{1,}\\.\\d{3})\\s+(\\d+)\\s+\"([^\"]+)\"\\s+Agent\\[\"([^\"]+)\"\\]\\s+-\\s+\\.\\s+msisdn\\[([^\\]]+)\\]\\s+xcall\\[([^\\]]+)\\]\\s+(\\d{1,}).*

在Java主程序中进行测试,它就像一个魅力:

匹配?真

Group 1: 200.222.108.241, 200.222.108.241, 200.222.108.241
Group 2: 04/Oct/2014:06:30:48 
Group 3: GET
Group 4:  /wml/redirect/jogos.wml 
Group 5: HTTP/1.1"
Group 6: 0.000
Group 7: 154
Group 8: -
Group 9: SAMSUNG-GT-E2222L/1.0 NetFront/4.1 Profile/MIDP-2.0 Configuration/CLDC-1.1
Group 10: -
Group 11: -
Group 12: 302

和voilá,我现在能够查询它。