我有.log文件有格式化:
t00aws.hma.com 101.225.128.165 AnonymousUser - [30/Aug/2013:02:17:22 -0400] "PUT /v1/patients/0000341934-821?accessToken=54189273 HTTP/1.1" 204 0 0.151 0.151 0.139 - 0.000 - "Java/1.6.0_31"
t00awsp.hma.com 101.225.128.165 AnonymousUser - [30/Aug/2013:02:17:22 -0400] "PUT /v1/encounters/0-2900172?accessToken=54189273 HTTP/1.1" 204 0 0.189 10.225.128.165 - 0.000 - "Java/1.6.0_31"
t00awsp.hma.com 101.225.128.165 AnonymousUser - [30/Aug/2013:02:17:31 -0400] "PUT /v1/encounters/84 -843-5085577?accessToken=54189273 HTTP/1.1" 204 0 0.151 10.225.128.165 - 0.000 - "Java/1.6.0_31"
t00aws.hma.com 101.225.128.165 AnonymousUser - [30/Aug/2013:02:17:31 -0400] "PUT /v1/encounters/84 843-5085577?accessToken=54189273 HTTP/1.1" 204 0 0.147 0.146 0.135 - 0.000 - "Java/1.6.0_31"
t00awsp2.hma.com 102.225.128.165 AnonymousUser - [30/Aug/2013:02:17:34 -0400] "PUT /v1/encounters/000 63-1332770?accessToken=54189273 HTTP/1.1" 204 0 0.152 0.152 0.140 - 0.000 - "Java/1.6.0_31"
我编写了一个方法来解析这个日志文件,并且希望找到使用字典调用url n次数的ip地址:
url_dict : {
'10.225.128.165' : ['v1/ready' , 4], #### 'ip' : ['url' , count]
'10.225.128.162' : ['/v2/fab' , 2]
}
这是我在views.py
中的代码def get_reports_hipaa(request):
wwwlog = lines_from_dir('*.log', '/home/arya/c/')
log_re = re.compile('^(?P<hostname>[\w.]*) (?P<clientip>[\d.]+) (?P<user>[\w-]+) (?P<application>[\w-]+) '+\
'(?P<request>\[\d+/\w+/\d+\:\d+\:\d+\:\d+[ \t]\-\d+\]) "(?P<method>GET|POST|PUT|DELETE|HEAD|TRACE|OPTIONS) (?P<url>.*?)'+\
' (?P<protocol>HTTP/1.[01])" (?P<status>\d+) (?P<bytes_sent>\d+) (?P<request_time>[\d.-]+) (?P<upstream_response_time>[\d.-]+)'+\
' (?P<hma_exec_time>[\d.-]+) (?P<mongo_exec_time>[\d.-]+) (?P<audit_response_time>[\d.-]+) (?P<queries_count>[\d.-]+) "(?P<user_agent>.*?)"$')
url_list_4xx = []
ip = {}
count = 0
unique_clientip = set()
unique_url = set()
url_dict = {}
for line in wwwlog :
print line
m = log_re.match(line)
if m :
request1 = m.groupdict()
resource_name = get_resource_name(request1['url'])
time = request1["request"].split(" ")[0].split("[")[1]
time = datetime.strptime(str(time), '%d/%b/%Y:%H:%M:%S')
list = []
clientip = request1["clientip"]
if clientip not in unique_clientip :
ip[clientip] = 0
if clientip in unique_clientip :
url = remove_access_token(request1['url'])
if url in unique_url :
list.append(url)
ip[clientip] += 1
list.append(ip[clientip])
url_dict[clientip] = list
else:
unique_url.add(url)
else :
unique_clientip.add(request1["clientip"])
return render(request, "hipaa_report.html", {"url_dict": url_dict})
我的输出不正确,是否有任何关于良好逻辑的建议?
答案 0 :(得分:2)
使用url_dict
的元组键:
key = (clientip, url)
url_dict[key] += 1
和
url_dict = defaultdict(0)
使计数器自动从0开始,将循环变为:
for line in wwwlog :
print line
m = log_re.match(line)
if m :
request1 = m.groupdict()
clientip = request1["clientip"]
url = remove_access_token(request1['url'])
key = (clientip, url)
url_dict[key] += 1