所以我需要解析一个nginx日志文件。在日志文件的中间,在每一行的最后添加了一个新变量。
我使用https://github.com/bbb1991/nginx-log-parser/blob/master/main.py作为灵感(即使用了大部分代码)。
import re
REQUEST_TIME_CONF = '$remote_addr - $remote_user [$time_local] "$request" ' \
'$status $body_bytes_sent $http_referer" ' \
'"$http_user_agent" "$gzip_ratio" "$request_time"'
def get_requests(file_name):
"""
"""
file_to_open = open(file_name,"r+")
log_lines = file_to_open.readlines()
lines = []
log_pattern = ''.join(
'(?P<' + g + '>.*?)' if g else re.escape(c)
for g, c in re.findall(r'\$(\w+)|(.)', REQUEST_TIME_CONF))
for line in log_lines:
lines.append(find(log_pattern,line))
return lines
def find(log_pattern, text):
match = re.match(log_pattern, text)
if match:
return match
else:
return False
def process_log(log_file):
requests = get_requests(log_file)
#print(requests)
for x in range(len(requests)):
request = requests[x]
request = request.groupdict()
remote_addr = request.get('remote_addr')
remote_user = request.get('remote_user')
time_local = request.get('time_local')
request_item = request.get('request')
status = request.get('status')
body_bytes_sent = request.get('body_bytes_sent')
http_referer = request.get('http_referer')
http_user_agent = request.get('http_user_agent')
gzip_ratio = request.get('gzip_ratio')
try:
request_time = request.get('request_time')
except AttributeError:
request_time = None
# print(remote_addr,remote_user,time_local,request_item,status,
# body_bytes_sent,http_referer,http_user_agent,gzip_ratio,
# request_time)
print(request)
access_log_to_parse = '/Users/username/Documents/Development/sample_access.log'
process_log(access_log_to_parse)
sample_access.log文件如下所示:
10.1.0.59 - - [12/Jul/2017:17:57:56 +0600] "POST /court/ws/avf HTTP/1.1" 500 296 "-" "CodeGear SOAP 1.3" "0.01" "0.003"
10.1.0.59 - userTest [12/Jul/2017:17:57:56 +0600] "POST /court/ws/avf HTTP/1.1" 500 296 "-" "CodeGear SOAP 1.3" "0.01"
Nginx具有特定的日志格式,该格式在REQUEST_TIME_CONF
中声明我已从最后一行删除了request_time,以模拟日志行没有此属性的实例。
因此,当存在request_time时,它将需要写入request_time值,否则只需写入None。
代码运行时产生以下错误:
AttributeError: 'bool' object has no attribute 'groupdict'
我对此做了一些更多的研究,看起来re模块在匹配(或不匹配)时返回TRUE或FALSE值,因为你可以看到我天真地尝试了一个try / catch for request_time认为如果价值不存在我可以通过Null但它没有用。
所以从它看起来我认为需要在log_pattern regex findall函数或re.match中进行某种检查,但我的python技能非常缺乏(因此代码借用!哈哈)
答案 0 :(得分:0)
re
模块返回匹配或None
从不布尔,但如果匹配为find
,则False
函数可以返回None
。在这种情况下,结果不应该被附加到列表中。
for line in log_lines:
request = find(log_pattern,line)
if request:
lines.append(request)