我正在尝试解析apache日志文件,希望通过python从access.log文件中提取AD公用名和文件名。
我的access.log文件如下:
[01/Jan/1901:12:00:01] 12.34.56.78 TLS Protocol EncryptionMethod "GET/.../filename.zip HTTP/1.1" "CN=Smith John A,......"
我想要提取的是以下格式:Smith John A, filename.zip
我试图从Github使用几个自定义python apache日志解析器而没有任何运气。
有任何想法实现这一目标吗?
感谢。
答案 0 :(得分:1)
真的很基本。
import re
with open('access.log') as log:
for line in log.readlines():
results = [_.group() for _ in re.finditer(r'"([^"]*)"', line)]
if len(results) == 2:
print (results)
else:
print (line)
print ("**** can't parse")
continue
m = re.search(r'GET\/.*?([a-z._]+) ', line, re.I)
count = 0
if m:
filename = m.groups(0)[0]
count += 1
else:
filename = ''
m = re.search(r'CN=([^,]+),', line, re.I)
if m:
name = m.groups(0)[0]
count += 1
else:
name = ''
print (name, filename)
if count != 2:
print ("***can't parse filename or name")
未测试!
该单行文件的结果:
['"GET/.../filename.zip HTTP/1.1"', '"CN=Smith John A,......"']
Smith John A filename.zip