如果不存在,请追加。如果存在,则递增计数

时间:2015-11-11 12:51:35

标签: python regex append

我是Python的新手(一般编程好),可以真正使用你的帮助。

我正在尝试通读防火墙日志文件。我感兴趣的是Deny中的所有线条。如果发现它应该提取源IP,目标IP,目标端口和协议。但我不想看到所有的线条,只有独特的线条。到现在为止还挺好。一切正常(尽管我确信它本来可以做得更聪明),但我也想添加一个计数器,这样我就可以看到s_ip,d_ip,d_port和protocol的特定组合发生了多少次,但是我不知道怎么做。

日志文件示例:

Nov  9 00:36:10 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/43882 dst outside:2.2.2.2/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:10 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/38780 dst outside:2.2.2.2/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:11 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/8273 dst outside:2.2.2.2/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/23433 dst outside:2.2.2.22/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/25175 dst outside:2.2.2.24/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/15855 dst outside:2.2.2.26/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/24574 dst outside:2.2.2.27/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny tcp src outside:1.1.1.1/21797 dst outside:2.2.2.29/23 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:12 firewall %ASA-4-106023: Deny udp src outside:3.3.3.3/12112 dst outside:2.2.2.99/53031 by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:13 firewall %ASA-4-106023: Deny icmp src outside:4.4.4.4 dst services:2.2.2.211 (type 11, code 1) by access-group "outside-in" [0x0, 0x0]
Nov  9 00:36:17 firewall %ASA-4-106023: Deny icmp src outside:4.4.4.4 dst services:2.2.2.10 (type 3, code 3) by access-group "outside-in" [0x0, 0x0]

我能够得到以下结果

'icmp'
'tcp', '1.1.1.1', '2.2.2.2', '23'
'tcp', '1.1.1.1', '2.2.2.22', '23'
'tcp', '1.1.1.1', '2.2.2.24', '23'
'tcp', '1.1.1.1', '2.2.2.26', '23'
'tcp', '1.1.1.1', '2.2.2.27', '23'
'tcp', '1.1.1.1', '2.2.2.29', '23'
'udp', '3.3.3.3', '2.2.2.99', '53031'

我还没有完全设法得到icmp输出(icmp没有/ port我的正则表达式使用它来获取IP地址),我会尝试使输出更好一些(尝试删除'而且,),但我真正想要的是每一行的hitcount,例如第一个tcp行的hitcount为3,依此类推。

import re       #for regular expressions - to match ip's
import sys      #for parsing command line opts

# if file is specified on command line, parse, else ask for file
if sys.argv[1:]:
    print "File: %s" % (sys.argv[1])
    logfile = sys.argv[1]
else:
    logfile = raw_input("Please enter a file to parse, e.g /var/log/secure: ")

match = []
seen = []

# find all Deny lines and append them in a list
for lines in open(logfile) :
    extract = re.findall('Deny.*"' ,lines)
    for i in extract :
        match.append(i)

# extract different keywords from Deny lines
for lines in match :
    prot = re.findall('Deny\s(.+?)\ssrc',lines)
    ip_src = re.findall('src.*?:([0-9a-f].*?)/', lines)
    ip_dst = re.findall('dst.*?:([0-9a-f].*?)/', lines)
    #ip_sport = re.findall('src.*?[0-9a-f].*?/([0-9].*?)\s', lines)     # uncomment if you want source port also, and add ip_sport to summarized below
    ip_dport = re.findall('dst.*?[0-9a-f].*?/([0-9].*?)\s', lines)

    summarized = prot + ip_src + ip_dst + ip_dport

    if summarized not in seen :             # only add unique entries
        seen.append(summarized)


# sort 
seen.sort()

for lines in seen :
    print ( ", ".join( repr(e) for e in lines ) )

此外,我试图在它上面放一个3GB的日志文件,它现在已经运行了几个小时。有什么好的想法来优化代码?

我意识到我提出了很多问题并且感谢任何帮助,但我的主要问题是帮助我们获得专柜。

2 个答案:

答案 0 :(得分:2)

Python标准库已经有Counter class

您可以将seen变量更改为Counter

from collections import Counter

[...]

seen = Counter()

# extract different keywords from Deny lines
for lines in match :

    [...]

    summarized = prot + ip_src + ip_dst + ip_dport

    # NOTE: summarized must be a string or tuple.
    seen.update([summarized])

最后,seen字典将每个唯一的汇总行作为键,每行的计数将是值。

关于优化,如果你在for lines in open(logfile)循环中处理每一行,它会更好(我认为)。

答案 1 :(得分:0)

为避免重复输入,您可以使用set代替list。我愿意:

seen = set()
for lines in open(logfile) :
    extract = re.findall('Deny.*"' ,lines)
    for i in extract :
        prot = re.findall('Deny\s(.+?)\ssrc',i)
        ip_src = re.findall('src.*?:([0-9a-f].*?)/', i)
        ip_dst = re.findall('dst.*?:([0-9a-f].*?)/', i)
        #ip_sport = re.findall('src.*?[0-9a-f].*?/([0-9].*?)\s', i)
        ip_dport = re.findall('dst.*?[0-9a-f].*?/([0-9].*?)\s', i)
        seen.add((prot, ip_src, ip_dst, ip_dport)) #Add here ip_sport if you want

这应该更快,因为它使用更少的循环,另一方面set是无序的(这里的#是建立它的方法,http://code.activestate.com/recipes/576694/)。如果您不想构建它并订购,则应在打印前将其转换为列表