如何查找特定IP在日志文件中出现的次数?

时间:2012-03-12 17:01:22

标签: python ip-address python-2.x

我有一个python脚本,它从日志文件中提取唯一的IP地址,并显示这些IP被ping的次数,代码如下。

 import sys

 def extract_ip(line):
     return line.split()[0]

 def increase_count(ip_dict, ip_addr):
     if ip_addr in ip_dict:
        ip_dict[ip_addr] += 1
     else:
        ip_dict[ip_addr] = 1

 def read_ips(infilename):
     res_dict = {}
     log_file = file(infilename)
     for line in log_file:
         if line.isspace():
            continue
         ip_addr = extract_ip(line)
         increase_count(res_dict, ip_addr)
     return res_dict

 def write_ips(outfilename, ip_dict):
     out_file = file(outfilename, "w")
     for ip_addr, count in ip_dict.iteritems():
         out_file.write("%5d\t%s\n" % (count, ip_addr))
     out_file.close()

 def parse_cmd_line_args():
     if len(sys.argv)!=3:
         print("Usage: %s [infilename] [outfilename]" % sys.argv[0])
         sys.exit(1)
     return sys.argv[1], sys.argv[2]

 def main():
     infilename, outfilename = parse_cmd_line_args()
     ip_dict = read_ips(infilename)
     write_ips(outfilename, ip_dict)

 if __name__ == "__main__":
     main()

我想在代码中添加一项功能,这样如果我们传递一个特定的URL,它应该返回通过哪个IP地址访问URL的次数。

E.g。如果我将网址作为输入传递:http://www.epicbrowser.com/hrefadd.xml

输出应采用以下格式

10.10.128.134        4
10.134. 222.232      6

日志文件采用以下格式,包含24k行。

220.227.40.118 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
220.227.40.118 - - [06/Mar/2012:00:00:00 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - -
59.95.13.217 - - [06/Mar/2012:00:00:00 -0800] "GET /dbupdates2.xml HTTP/1.1" 404 0 - -
111.92.9.222 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
120.56.236.46 - - [06/Mar/2012:00:00:00 -0800] "GET /hrefadd.xml HTTP/1.1" 204 214 - -
49.138.106.21 - - [06/Mar/2012:00:00:00 -0800] "GET /add.txt HTTP/1.1" 204 214 - -
117.195.185.130 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -
122.160.166.220 - - [06/Mar/2012:00:00:00 -0800] "GET /mysidebars/newtab.html HTTP/1.1" 404 0 - -

1 个答案:

答案 0 :(得分:3)

首先,不要重新发明轮子,而是使用Counter对象。

其次,使用re.match()提取IP地址 - 这样您就不需要处理不必处理不具有可解析IP地址的行的行。

喜欢的东西;

import re
from collections import Counter

cnt = Counter()
ipre = re.compile(r'^(?P<ip>(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])) - -')
with open(infilename) as infile:
    for line in infile:
        m = ipre.match(line)
        if m is not None:
            ip = m.groupdict()['ip']
            cnt[ip] += 1