我对python还是很陌生,我试图(使用python 3)浏览大量大型自定义日志文件,以从某些GET
请求中提取参数,并尝试从中收集一些统计信息。我走得很远,但是我遇到了两个问题,我和我的同事无法弄清为什么他们让我们如此头痛。我将分别发布两个问题,以免使您感到困惑。
我的日志文件如下:
80 172.23.131.149 "2018-07-05 13:08:25 860" "POST /bios/servlet/bios.servlets.sso.WaffleLoginServlet HTTP/1.1" 401 5 891 891 "-" "Java/1.8.0_171"
8080 172.23.131.251 "2018-07-05 13:08:26 594" "HEAD /bios/servlet/bios.servlets.web.Ping?level=3 HTTP/1.0" 200 - 1953 1953 "-" "-"
8080 172.23.131.252 "2018-07-05 13:08:26 594" "HEAD /bios/servlet/bios.servlets.web.Ping?level=3 HTTP/1.0" 200 - 953 953 "-" "-"
80 172.23.131.149 "2018-07-05 13:08:28 188" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156240.234375%2C6576777.34375%2C156269.53125%2C6576806.640625 HTTP/1.1" 200 133210 3547 3516 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 188" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156240.234375%2C6576748.046875%2C156269.53125%2C6576777.34375 HTTP/1.1" 200 108066 3547 3532 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 188" "POST /bios/servlet/bios.servlets.GetGeometryComponents HTTP/1.1" 401 4 2484 2484 "-" "Java/1.8.0_171"
80 172.23.131.149 "2018-07-05 13:08:28 204" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156210.9375%2C6576806.640625%2C156240.234375%2C6576835.9375 HTTP/1.1" 200 123953 3563 3547 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 204" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156210.9375%2C6576777.34375%2C156240.234375%2C6576806.640625 HTTP/1.1" 200 147132 3563 3547 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 204" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156269.53125%2C6576777.34375%2C156298.828125%2C6576806.640625 HTTP/1.1" 200 145701 3563 3547 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.137.120 "2018-07-06 10:04:32 856" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_GRA?FORMAT=image%2Fpng&TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&STYLES=&SRS=EPSG%3A5850&BBOX=150000,6580000,151875,6581875&WIDTH=256&HEIGHT=256 HTTP/1.1" 200 58443 0 0 "https://iservice.stockholm.se/open/TyckTill/Pages/TyckTill.aspx?systemId=synpunktsportalen" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
80 172.23.137.120 "2018-07-06 10:04:25 400" "GET /bios/dpwebmap/cust_sth/slk/tycktill/app.htmlclient.gwt.DPWebApp.nocache.js HTTP/1.1" 200 3924 0 0 "https://iservice.stockholm.se/open/TyckTill/Pages/TyckTill.aspx?systemId=synpunktsportalen" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
我想做的是为所有带有字符串REQUEST=GetMap
的行提取IP地址。我正在使用的正则表达式是:
rexp_ip = r"(?P<ip>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))"
我正在使用键ip
来计算我的代码中日志文件中所有IP地址的出现次数。
我一直在盯着正则表达式,并试图来回更改它,但仍然无法正常工作。 But it works in Regex101 which is very confusing
该任务的完整代码为:
#!/usr/bin/env python3
import os
import re
from collections import Counter
# regular expression
#rexp = [r'(?P<timestamp>\d{1,2}-\w{3}-\d{4} \d{2}:\d{2}:\d{2}\.\d{3}) client (?P<client>(?:\d{1,3}\.){3}\d{1,3}).+query: (?P<domain>.+) IN (?P<qtype>[A-Z]+) \+.+\({2}(?P<server>(?:\d{1,3}\.){3}\d{1,3})\){2}'
#rexp = r"(^.+layers=(?P<domain>.*?)&)" # sök efter LAYERS= eller layers=
rexp_layer = r"(^.+layers=(?P<domain>.*?)[&\s])" # search for the name of the requested layer (between the string 'LAYERS=' or 'layers=' and a ampersand '&' or blankspace ' ') in each line and give it the key 'domain'
rexp_port = r"(?P<port>\d{2,4} )" # search for the 2 or 4 digit value in the beginning of each line
rexp_ip = r"(?P<ip>(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}))"
rexp_date = r"(?P<datum>\d{4}\-\d{2}\-\d{2})" # search for the date in format XXXX-XX-XX and give it the key 'datum'
rexp_time = r"(?P<tid>\d{2}\:\d{2}\:\d{2})" # search for the timestamp XX:XX:XX and give it the key 'tid'
rexp_name = r"(^.+/bios/wms/app/(?P<name>.+?)\?)" # search for the name of the called WMS-service (are between the string '/bios/wms/app/' and a '?') and give it the key 'name'to the FIRST occurrence of "?", "+?" makes the "+" non-greedy
rexp_coordsys = r"(^.+&\wRS=(?P<koordsys>.*?)[&\s])" # search for the coordinate system between the string '&SRS=' or '&CRS=' and a ampersand '&' and give it the key 'koordsys'
rexp_width = r"(^.+WIDTH=(?P<width>.*?)&)" # search for the width of the requested picture (are between the string 'WIDTH=' and a ampersand '&') and give it the key 'width'
rexp_height = r"(^.+HEIGHT=(?P<height>.*?)[&\s])" # search for the height of the requested picture (are between the string 'HEIGHT=' and a ampersand '&') and give it the key 'height'
# rexp_bbox = r"(((?P<bbox_xmin>-?\d+\.?\d*)%2C)((?P<bbox_ymin>-?\d+\.?\d*)%2C)((?P<bbox_xmax>-?\d+\.?\d*)%2C)((?P<bbox_ymax>-?\d+\.?\d*)[\s&]))" # FUNKAR INTE ÄNNU HÄR KAN MAN FORTSÄTTA
# create counter dictionary
cnt_domains = Counter() # for counting the occurrances of a certain layer
cnt_port = Counter() # for counting the occurrances of a certain layer
cnt_ip = Counter() # for counting the occurrances of a IP-adress
#cnt_date = Counter() # for counting the occurrances of a certain date -- i probably will not use that
cnt_name = Counter() # for counting the occurrances of a certain service
cnt_coordsys = Counter() # for counting the occurrances of a certain coordinate system
cnt_width = Counter() # for counting the occurrances of a certain requested width
cnt_height = Counter() # for counting the occurrances of a certain requested height
cnt_bbox = Counter()
# Compile regular expression for faster computing
rexp_layer_compile = re.compile(rexp_layer, re.IGNORECASE) # get the regex to look for occurrences of LAYERS or layers - seems to work
rexp_port_compile = re.compile(rexp_port)
rexp_ip_compile = re.compile(rexp_ip)
rexp_name_compile = re.compile(rexp_name, re.IGNORECASE) # No diffenence with re.IGNORECASE
rexp_coordsys_compile = re.compile(rexp_coordsys) # mixes in regex for layers
rexp_width_compile = re.compile(rexp_width, re.IGNORECASE)
rexp_height_compile = re.compile(rexp_height, re.IGNORECASE)
# rexp_bbox_compile = re.compile(rexp_bbox)
# Path to folder with log files
#path = '/home/uwestephan/Logg-file-parsing/ws00848'
# path = '/home/uwestephan/Logg-file-parsing/ws00524'
# path = '/home/uwestephan/Logg-file-parsing/ws00524_test'
path = '/home/uwestephan/Logg-file-parsing/ws00848_test'
# setting the line counters to zero
matchedGETMAP = 0
failedGETMAP = 0
failed = 0
failedLAYER = 0
# open file
for filename in os.listdir(path):
filmedsokvag = (path+"/"+filename)
print (filmedsokvag)
# read file / gather data
f = open(filmedsokvag, 'r')
# exclude all lines that do not have the string 'GetMap' in it
for line in f:
if re.findall('GetMap',line): # check if there is a string 'GetMap' in the line in the log file
m = re.match(rexp_layer_compile, line) # match the name of the requested layer
p = re.match(rexp_port_compile, line) # match the port
i = re.match(rexp_ip_compile, line) # match the IP-adress
n = re.match(rexp_name_compile, line) # match the name of the WMS-service thats requested
c = re.match(rexp_coordsys_compile, line) # match the coordinate system
w = re.match(rexp_width_compile, line) # match the width of the requested picture that the WMS-service is sending
h = re.match(rexp_height_compile, line) # match the height of the requested picture that the WMS-service is sending
# b = re.match(rexp_bbox_compile, line)
if m:
cnt_domains.update([m.group('domain')]) # here I try to count the occurrences of a the layer names
# matchedGETMAP += 1 # add 1 to the line counter that count processed lines in the file (as i do not process all lines in this if sentence)
else:
# failedGETMAP += 1
failedLAYER += 1 # Counts the number of lines with a getmap request who do NOT have the parameter LAYER called
if p:
cnt_port.update([p.group('port')]) # here I try to count the occurrences of a the differnt ports
# else:
# continue
if i:
cnt_ip.update([i.group('ip')]) # here I try to count the occurrences of the IP-adresses - THAT ONE DOES NOT WORK
#For debugging only - the regular expression for the IP adress seems not to work
else:
print("Cannot find IP address")
if n:
cnt_name.update([n.group('name')]) # here I try to count the occurrences of a the names of the WMS-services
matchedGETMAP += 1 # add 1 to the line counter that count processed lines in the file (as i do not process all lines in this if sentence)
else:
failedGETMAP += 1
if c:
cnt_coordsys.update([c.group('koordsys')]) # here I try to count the occurrences of a coordinate systems
# else:
# continue
if w:
cnt_width.update([w.group('width')]) # here I try to count the occurrences of the widths of the requested pictures that the WMS-service is sending
# else:
# continue
if h:
cnt_height.update([h.group('height')]) # here I try to count the occurrences of the heights of the requested pictures that the WMS-service is sending
# else:
# continue
# if b:
# cnt_bbox.update([b.group('bbox_xmin')]) # here I try to count the occurrences of the heights of the requested pictures that the WMS-service is sending
# else:
# continue
else:
failed += 1 # add 1 to the counter that counts the lines that NOT processed by the if sentence above
continue
# Remove hyphon from the cnt_domains dictionary - not realy neccesarry -> IT CREATES NOT A COUNTER DICTIONARY BUT A NORMAL DICTIONARY
# cnt_domains = {key.replace('"',''): val for key,val in cnt_domains.items()}
# Create an empty dictionary for my replace values
f100 = open('Oversattningstabell_for_lagernamn_csv.csv', 'r')
DictionaryReplaceValues = {}
for line in f100:
x = line.split(",")
a = x[0]
b = x[1]
c = len(b)-1 # Removes the \n from the end of each line by counting the lenght of the line b and the reassigning a shorter string back to b
b = b[0:c] # Removes the \n from the end of each line by counting the lenght of the line b and the reassigning a shorter string back to b
DictionaryReplaceValues[a]=b
print("\n\nDet här är min Replacement dictionary")
for key in DictionaryReplaceValues.keys():
print (key, " = ", DictionaryReplaceValues[key])
# Create an empty dictionary for the translated dictionary - Not really neccesarry
cnt_domains_newname = {}
# Replace the old dictionary with an new one using the translating dictionary DictionaryReplaceValues
cnt_domains_newname = dict((DictionaryReplaceValues.get(key, key), value) for (key, value) in cnt_domains.items())
# Make a counter out of the dictionary created above
new_counter_cnt_domains_newname = Counter(cnt_domains_newname)
# Output Results
print('[*] %d Number of GetMap request that matched the regular expression' % (matchedGETMAP))
print('[*] %d Number of GetMap request that failed to match the regular expression' % (failedGETMAP), end='\n\n')
print('[*] %d Number of other request in the log files ' % (failed), end='\n\n')
print('[*] %d Number of GetMap requests that request the Top layer of the WMS' % (failedLAYER), end='\n\n')
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring Layer Queried')
print('[*] ============================================')
#for domain, count in cnt_domains_newname.most_common(100):
for domain, count in new_counter_cnt_domains_newname.most_common(100):
print('[*] %60s: %d' % (domain, count))
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring Port Queried')
print('[*] ============================================')
for port, count in cnt_port.most_common(100):
print('[*] %60s: %d' % (port, count))
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring IP-adresses Queried')
print('[*] ============================================')
for ip, count in cnt_ip.most_common(100):
print('[*] %60s: %d' % (ip, count))
# print(ip, count)
print('[*] ============================================')
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring WMS-name Queried')
print('[*] ============================================')
for name, count in cnt_name.most_common(100):
print('[*] %60s: %d' % (name, count))
print('[*] ============================================')
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring Coordinate Systemes Queried')
print('[*] ============================================')
for koordsys, count in cnt_coordsys.most_common(100):
print('[*] %60s: %d' % (koordsys, count))
print('[*] ============================================')
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring Picture Widths Queried')
print('[*] ============================================')
for width, count in cnt_width.most_common(100):
print('[*] %60s: %d' % (width, count))
print('[*] ============================================')
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring Picture Heights Queried')
print('[*] ============================================')
for height, count in cnt_height.most_common(100):
print('[*] %60s: %d' % (height, count))
print('[*] ============================================')
#print('[*] ============================================')
#print('[*] 100 Most Frequently Occurring BBOX_xmin Queried')
#print('[*] ============================================')
#for bbox_xmin, count in cnt_bbox.most_common(100):
# print('[*] %30s: %d' % (bbox_xmin, count))
#print('[*] ============================================')
# Output results to file
with open('parseroutput.txt', 'w') as fd:
print('[*] %d Number of GetMap request that matched the regular expression' % (matchedGETMAP), file=fd)
print('[*] %d Number of GetMap request that failed to match the regular expression' % (failedGETMAP), end='\n\n', file=fd)
print('[*] %d Number of other request in the log files ' % (failed), end='\n\n', file=fd)
print('[*] %d Number of GetMap requests that request the Top layer of the WMS' % (failedLAYER), end='\n\n', file=fd)
print('[*] ============================================', file=fd)
print('[*] 100 Most Frequently Occurring Layer Queried', file=fd)
print('[*] ============================================', file=fd)
for domain, count in new_counter_cnt_domains_newname.most_common(100):
print('%s: %d' % (domain, count), file=fd)
print('[*] ============================================', file=fd)
print('[*] 100 Most Frequently Occurring Port Queried', file=fd)
print('[*] ============================================', file=fd)
for port, count in cnt_port.most_common(100):
print('%s: %d' % (port, count), file=fd)
print('[*] ============================================', file=fd)
print('[*] 100 Most Frequently Occurring IP-adresses Queried', file=fd)
print('[*] ============================================', file=fd)
for ip, count in cnt_ip.most_common(100):
print('%s: %d' % (ip, count), file=fd)
print(ip, count)
print('[*] ============================================', file=fd)
print('[*] ============================================', file=fd)
print('[*] 100 Most Frequently Occurring WMS-name Queried', file=fd)
print('[*] ============================================', file=fd)
for name, count in cnt_name.most_common(100):
print('%s: %d' % (name, count), file=fd)
print('[*] ============================================', file=fd)
print('[*] ============================================', file=fd)
print('[*] 100 Most Frequently Occurring Coordinate Systemes Queried', file=fd)
print('[*] ============================================', file=fd)
for koordsys, count in cnt_coordsys.most_common(100):
print('%s: %d' % (koordsys, count), file=fd)
print('[*] ============================================', file=fd)
print('[*] ============================================', file=fd)
print('[*] 100 Most Frequently Occurring Picture Widths Queried', file=fd)
print('[*] ============================================', file=fd)
for width, count in cnt_width.most_common(100):
print('%s: %d' % (width, count), file=fd)
print('[*] ============================================', file=fd)
print('[*] ============================================', file=fd)
print('[*] 100 Most Frequently Occurring Picture Heights Queried', file=fd)
print('[*] ============================================', file=fd)
for height, count in cnt_height.most_common(100):
print('%s: %d' % (height, count), file=fd)
print('[*] ============================================', file=fd)
您是否知道如何实现正则表达式提取IP地址?
答案 0 :(得分:0)
下面的表达式可以获取IP地址
rexp_ip = r".*\s(?P<ip>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*"
答案 1 :(得分:0)
您可以使用re.findall
查找所需的主要行(ip,请求/时间/端口,请求类型等),然后使用urllib.parse
查找其他必要的值:
import re
from urllib.parse import parse_qs
def parse_line(_d:str, flag = 'datum'):
_headers = {'datum':['datum', 'tid'], 'server':['WMS_service', 'coord', 'width', 'height']}
if flag == 'datum':
return dict(zip(_headers[flag], re.findall('\d+\-\d+\-\d+|\d+:\d+:\d+', _d)))
new_d = parse_qs(_d)
return dict(zip(_headers[flag], [*re.findall('/bios/wms/app/(.*?)\?', _d), *new_d.get('SRS', new_d.get('CRS', [])), *new_d.get('WIDTH', []), *new_d.get('HEIGHT', [])]))
file_data = [i.strip('\n') for i in open('filename.txt')]
new_data = [[re.findall('\d+\.\d+\.\d+\.\d+|\d+', re.sub('".*?"', '', i)), re.findall('".*?"', i)] for i in file_data]
final_results = []
for a, b in new_data:
_temp = dict(zip(['port', 'ip'], a))
_temp1 = {**_temp, **parse_line(b[0])} if len(b) == 1 else {**_temp, **parse_line(b[0]), **parse_line(b[1], 'server')}
final_results.append(_temp1)
for i in final_results:
print(i)
输出:
{'port': '80', 'ip': '172.23.131.149', 'datum': '2018-07-05', 'tid': '13:08:25'}
{'port': '8080', 'ip': '172.23.131.251', 'datum': '2018-07-05', 'tid': '13:08:26'}
{'port': '8080', 'ip': '172.23.131.252', 'datum': '2018-07-05', 'tid': '13:08:26'}
{'port': '3', 'ip': '1'}
{'port': '80', 'ip': '172.23.131.149', 'datum': '2018-07-05', 'tid': '13:08:28', 'WMS_service': 'baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN', 'coord': 'EPSG:3011', 'width': '256', 'height': '256'}
{'port': '80', 'ip': '172.23.131.149', 'datum': '2018-07-05', 'tid': '13:08:28', 'WMS_service': 'baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN', 'coord': 'EPSG:3011', 'width': '256', 'height': '256'}
{'port': '80', 'ip': '172.23.131.149', 'datum': '2018-07-05', 'tid': '13:08:28'}
{'port': '80', 'ip': '172.23.131.149', 'datum': '2018-07-05', 'tid': '13:08:28', 'WMS_service': 'baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN', 'coord': 'EPSG:3011', 'width': '256', 'height': '256'}
{'port': '80', 'ip': '172.23.131.149', 'datum': '2018-07-05', 'tid': '13:08:28', 'WMS_service': 'baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN', 'coord': 'EPSG:3011', 'width': '256', 'height': '256'}
{'port': '80', 'ip': '172.23.131.149', 'datum': '2018-07-05', 'tid': '13:08:28', 'WMS_service': 'baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN', 'coord': 'EPSG:3011', 'width': '256', 'height': '256'}
{'port': '80', 'ip': '172.23.137.120', 'datum': '2018-07-06', 'tid': '10:04:32', 'WMS_service': 'baggis/web/WMS_STHLM_STOCKHOLMSKARTA_GRA', 'coord': 'EPSG:5850', 'width': '256', 'height': '256 HTTP/1.1"'}
{'port': '80', 'ip': '172.23.137.120', 'datum': '2018-07-06', 'tid': '10:04:25'}