我必须根据客户端IP和访问的主机来解析30天来自服务器的访问日志,并且需要了解访问量排名前10位的网站。日志文件的大小约为10-20 GB,单线程执行脚本需要花费大量时间。最初,我编写了一个脚本,该脚本运行良好,但是由于日志文件很大,因此要花费很多时间。然后,我尝试实现用于并行处理的多处理库,但是它不起作用。似乎多处理的实现是重复任务,而不是并行处理。不确定,代码中有什么问题。有人可以帮忙吗?提前非常感谢您的帮助。
代码:
from datetime import datetime, timedelta
import commands
import os
import string
import sys
import multiprocessing
def ipauth (slave_list, static_ip_list):
file_record = open('/home/access/top10_domain_accessed/logs/combined_log.txt', 'a')
count = 1
while (count <=30):
Nth_days = datetime.now() - timedelta(days=count)
date = Nth_days.strftime("%Y%m%d")
yr_month = Nth_days.strftime("%Y/%m")
file_name = 'local2' + '.' + date
with open(slave_list) as file:
for line in file:
string = line.split()
slave = string[0]
proxy = string[1]
log_path = "/LOGS/%s/%s" %(slave, yr_month)
try:
os.path.exists(log_path)
file_read = os.path.join(log_path, file_name)
with open(file_read) as log:
for log_line in log:
log_line = log_line.strip()
if proxy in log_line:
file_record.write(log_line + '\n')
except IOError:
pass
count = count + 1
file_log = open('/home/access/top10_domain_accessed/logs/ipauth_logs.txt', 'a')
with open(static_ip_list) as ip:
for line in ip:
with open('/home/access/top10_domain_accessed/logs/combined_log.txt','r') as f:
for content in f:
log_split = content.split()
client_ip = log_split[7]
if client_ip in line:
content = str(content).strip()
file_log.write(content + '\n')
return
if __name__ == '__main__':
slave_list = sys.argv[1]
static_ip_list = sys.argv[2]
jobs = []
for i in range(5):
p = multiprocessing.Process(target=ipauth, args=(slave_list, static_ip_list))
jobs.append(p)
p.start()
p.join()
答案 0 :(得分:0)
与OP对话后更新,请查看评论
我的观点:将文件拆分为较小的块,并使用进程池处理这些块:
import multiprocessing
def chunk_of_lines(fp, n):
# read n lines from file
# then yield
pass
def process(lines):
pass # do stuff to a file
p = multiprocessing.Pool()
fp = open(slave_list)
for f in chunk_of_lines(fp,10):
p.apply_async(process, [f,static_ip_list])
p.close()
p.join() # Wait for all child processes to close.
有许多方法可以实现chunk_of_lines
方法,您可以使用简单的for
遍历文件行,或者执行诸如调用fp.read()
之类的更高级的操作。