如何根据其格式配置生成正则表达式以匹配access.log?

时间:2015-01-17 13:07:44

标签: python regex apache logging nginx

access.log格式配置可能类似于

'$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"'

有没有办法生成一个正则表达式来匹配access.log根据它?我可以根据实际日志编写正则表达式:

'112.3.194.120 - - [17/Jan/2015:20:07:34 +0800] "GET /Introdction%20to%20Guitar/1%20-%202%20-%20Choosing%20the%20Right%20Guitar-%20Right-Handed%20vs%20Left-Handed%20(3-20).mp4 HTTP/1.1" 206 546849 "http://example.com/video/302/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"'

但我不能用格式配置编写正则表达式。有人可以帮忙吗?

1 个答案:

答案 0 :(得分:6)

要从配置构建表达式,请将$xxx等配置变量替换为(?P<xxx>.*?)等命名组和转义分隔符:

import re

conf = '$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"'
regex = ''.join(
    '(?P<' + g + '>.*?)' if g else re.escape(c)
    for g, c in re.findall(r'\$(\w+)|(.)', conf))

现在,如果您将日志条目与此表达式匹配:

log = '112.3.194.120 - - [17/Jan/2015:20:07:34 +0800] "GET /Introdction%20to%20Guitar/1%20-%202%20-%20Choosing%20the%20Right%20Guitar-%20Right-Handed%20vs%20Left-Handed%20(3-20).mp4 HTTP/1.1" 206 546849 "http://example.com/video/302/" "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36"'
m = re.match(regex, log)

您的变量会在matchObject.groupdict

中被捕获
import pprint
pprint.pprint(m.groupdict())

结果:

{'body_bytes_sent': '546849',
 'http_referer': 'http://example.com/video/302/',
 'http_user_agent': 'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36',
 'remote_addr': '112.3.194.120',
 'remote_user': '-',
 'request': 'GET /Introdction%20to%20Guitar/1%20-%202%20-%20Choosing%20the%20Right%20Guitar-%20Right-Handed%20vs%20Left-Handed%20(3-20).mp4 HTTP/1.1',
 'status': '206',
 'time_local': '17/Jan/2015:20:07:34 +0800'}

如果您的日志配置中没有分隔符,则您必须使用更具体的子模式,而不仅仅是.*。这可以用类似的方式优雅地编码:

# variable-specific patterns
patterns = {
    'remote_addr': r'(\d{1,3}\.){3}\d{1,3}',
    'body_bytes_sent': r'\d+',
    # etc
}

regex = ''.join(
    '(?P<%s>%s)' % (g, patterns.get(g, '.*?')) if g
        else re.escape(c)
    for g, c in re.findall(r'\$(\w+)|(.)', conf))