使用python

时间:2018-07-18 09:40:50

标签: python python-3.x parsing

问题

我是python的新手,我正尝试(使用python)浏览大量大型自定义日志文件,以从某些GET请求中提取参数,并尝试从中获取som统计信息。

我正在解析的日志文件如下:

80 172.23.131.149 "2018-07-05 13:08:25 860" "POST /bios/servlet/bios.servlets.sso.WaffleLoginServlet HTTP/1.1" 401 5 891 891 "-" "Java/1.8.0_171"
8080 172.23.131.251 "2018-07-05 13:08:26 594" "HEAD /bios/servlet/bios.servlets.web.Ping?level=3 HTTP/1.0" 200 - 1953 1953 "-" "-"
8080 172.23.131.252 "2018-07-05 13:08:26 594" "HEAD /bios/servlet/bios.servlets.web.Ping?level=3 HTTP/1.0" 200 - 953 953 "-" "-"
80 172.23.131.149 "2018-07-05 13:08:28 188" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156240.234375%2C6576777.34375%2C156269.53125%2C6576806.640625 HTTP/1.1" 200 133210 3547 3516 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 188" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156240.234375%2C6576748.046875%2C156269.53125%2C6576777.34375 HTTP/1.1" 200 108066 3547 3532 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 188" "POST /bios/servlet/bios.servlets.GetGeometryComponents HTTP/1.1" 401 4 2484 2484 "-" "Java/1.8.0_171"
80 172.23.131.149 "2018-07-05 13:08:28 204" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156210.9375%2C6576806.640625%2C156240.234375%2C6576835.9375 HTTP/1.1" 200 123953 3563 3547 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 204" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156210.9375%2C6576777.34375%2C156240.234375%2C6576806.640625 HTTP/1.1" 200 147132 3563 3547 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
80 172.23.131.149 "2018-07-05 13:08:28 204" "GET /bios/wms/app/baggis/web/WMS_STHLM_STOCKHOLMSKARTA_HYBRID_INTERN?TILED=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&FORMAT=image%2Fpng&TRANSPARENT=false&LAYERS=p_1002095&SRS=EPSG%3A3011&STYLES=&r=n2q&WIDTH=256&HEIGHT=256&BBOX=156269.53125%2C6576777.34375%2C156298.828125%2C6576806.640625 HTTP/1.1" 200 145701 3563 3547 "http://tkkarta3.stockholm.se/astolmap/v3/kopplet/tkkarta.htm" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"

我想做什么?

  1. 提取请求“ GetMap”中包含某个Word的所有行(这意味着这些行显示了Web Map Server请求,而我只是对此感兴趣)
  2. 从这些行中提取请求中“ LAYER =“或“ layer =””之后的参数,并以与号(&)结尾,并使用正则表达式将其命名为带有键的“ lager”(应返回例如“ p_1002095” )
  3. 总结键“ lager”的出现次数

我很难让上面的1号工作。 我找不到任何有用的信息(可能不是在寻找正确的东西)。问题似乎是单词“ GetMap”位于较长的字符串中。但这听起来有点容易,但是我无法弄清楚该怎么做。

我现在在上面的任务列表中用于执行数字2和3的代码是:

#!/usr/bin/env python3

import os
import re
from collections import Counter

# regular expression
rexp = r"(^.+[LAYERSlayers]=(?P<domain>.*?)&)" # sök efter LAYERS= eller layer=
# create counter dictionary
cnt_domains = Counter()

path = '/home/uwestephan/Logg-file-parsing/ws00524'

matched = 0
failed = 0
for filename in os.listdir(path):
    filmedsokvag = (path+"/"+filename)
    print (filmedsokvag)

    # read file / gather data
    f = open(filmedsokvag, 'r')
    for line in f:
        m = re.match(rexp, line)
        if m:
            cnt_domains.update([m.group('domain')])
            matched += 1
        else:
            failed += 1

# Output Results
print('[*] %d lines matched the regular expression' % (matched))
print('[*] %d lines failed to match the regular expression' % (failed), end='\n\n')
print('[*] ============================================')
print('[*] 100 Most Frequently Occurring of Lager Queried')
print('[*] ============================================')
for domain, count in cnt_domains.most_common(100):
    print('[*] %30s: %d' % (domain, count))
print('[*] ============================================')

# Output results to file
with open('parseroutput.txt', 'w') as fd:
    print('[*] %d lines matched the regular expression' % (matched), file=fd)
    print('[*] %d lines failed to match the regular expression' % (failed), end='\n\n', file=fd)
    print('[*] ============================================', file=fd)
    print('[*] 100 Most Frequently Occurring Lager Queried', file=fd)
    print('[*] ============================================', file=fd)
    for domain, count in cnt_domains.most_common(100):
      print('[*] %30s: %d' % (domain, count), file=fd)
    print('[*] ============================================', file=fd)

您对如何提取GetMap请求有疑问吗? 先感谢您!

1 个答案:

答案 0 :(得分:0)

检查行是否包含'GetMap',如果不包含则跳过行。

for line in f:
    if 'GetMap' in line:  # check for 'GetMap'
        m = re.match(rexp, line)
        if m:
            cnt_domains.update([m.group('domain')])
            matched += 1
        else:
            failed += 1