我想从一个看起来像这样的日志文件列表(名为access.log.*
)中提取
95.11.113.x - [15/Nov/2013:18:25:17 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
95.11.113.x - [15/Nov/2013:18:25:19 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
95.11.113.x - [15/Nov/2013:18:25:21 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
125.111.9.x - [15/Nov/2013:20:00:00 +0100] "GET /files/azeazzae.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013:11:15:11 +0100] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
访问/files/myfile.rar
的唯一访问者列表(每天只有一次重复),即:
95.11.113.x - [15/Nov/2013] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
132.41.100.x - [16/Nov/2013] "GET /files/myfile.rar HTTP/1.1" 200 2437305154 blah.com "http://www.blah.com/files/" "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)" "-"
我尝试打开文件并查找所需的字符串/files/myfile.rar
,如下所示:Search for string in txt file Python,但我无法测试“相同的IP地址”和重复。
我应该怎么做才能做到这一点?标准字符串搜索,一行接一行(Search for string in txt file Python)?正则表达式吗
PS:以后使用更好(按日期排序等):
2013-11-15 - 95.11.113.x - "GET /files/myfile.rar HTTP/1.1"
2013-11-16 - 132.41.100.x - "GET /files/myfile.rar HTTP/1.1"
2013-11-17 ....
答案 0 :(得分:1)
这应该是你的python代码的算法:
1)从文件中读取每一行
2)如果该行包含文本/files/myfile.rar
,则为
3)从线路解析IP地址。您可以使用正则表达式,也可以在空格之前使用拆分。
4)以这种方式将行保存到python中的dict()
变量visitors[ip] = line
完成后,打印visitors
输出。
以下是3)和4)的示例代码。
visitors = dict()
# this should be same for each line
line = '95.11.113.x - [15/Nov/2013]'
ip = line.split(" - ")[0] # assuming it must have " - " in line
visitors[ip] = line
# finally when you are done with above things
for visitor in visitors:
print visitors[visitor]
答案 1 :(得分:1)
以下是按日期排序答案的方法,即每天请求myfile.rar
的唯一身份访问者对所有名为access.log.*
的文件进行排序:
import glob
from collections import defaultdict
d = defaultdict(set)
for file in glob.glob('access.log.*'):
with open(file) as log:
for line in log:
if len(line.strip()): # skips empty lines
bits = line.split('-')
ip = bits[0].strip()
date = bits[1].split()[0][1:][:-9]
url = bits[1].split()[3]
if url == '/files/myfile.rar':
d[date].add(ip)
for date,values in d.iteritems():
print('Total unique visits for {}: {}'.format(date, len(values))
for ip in values:
print(ip)
答案 2 :(得分:0)
以下答案是SabujHassan的回答方法的结果。我只发布它以备将来使用。
visitors = dict()
with open('access.log.52') as fp:
for line in fp:
if '/files/myfile.rar' in line:
ip = line.split(" - ")[0] # assuming it must have " - " in line
visitors[ip] = line
for ip in visitors:
print visitors[ip]