我将分析关于1G的apache'slog文件。我写了一个python脚本,获得大约18秒的结果。 脚本是:
#!/usr/bin/python
import sys
filename = sys.argv[1]
name = {}
with open(filename,"r") as data:
for i in data:
av = i.split()
if name.has_key(av[7]):
name[av[7]] = int(av[4])
else
name[av[7]] += int(av[4])
mm = open("ziyou_end","w")
#print result to ziyou_end
for i in name:
mm.write("%s %s\n" %(i,name[i]))
我会提高脚本的速度,然后我会使用线程。
#!/usr/bin/python
import threading
import Queue
import sys
import time
all = {}
def do_work(in_queue, out_queue):
while True:
#print 1
item = in_queue.get()
#print "item is",item
#time.sleep(1)
# process
aitem = item.split()
if all.has_key(av[7]):
all[av[7]] = int(av[4])
else:
all[av[7]] += int(av[4])
#out_queue.put(all)
#print all
in_queue.task_done()
if __name__ == "__main__":
work = Queue.Queue()
results = Queue.Queue()
af = open(sys.argv[1],"r")
#get file
af_con = []
for i in af:
af_con.append(i);
# start for workers
for i in xrange(4):
t = threading.Thread(target=do_work, args=(work, results))
t.daemon = True
t.start()
#start 4 threading
# produce data
for i in af_con:
work.put(i)
work.join()
result = open ("result_thread","w");
# get the results
for i in all:
result.write(i+str(all[i])+"\n")
sys.exit()
但它花了320秒才得到结果,谁能告诉我,为什么
我使用多处理是一样的,花了很长时间才得到结果#!/usr/bin/env python
#coding:utf-8
from multiprocessing import Pool
import time
import os
import time
import sys
filename = sys.argv[1]
ALL = {}
def process_line(line):
global ALL
av = line.split()
i = av[7]
if ALL.has_key(i):
ALL[i] = ALL[i] + int(av[4])
else:
ALL[i] = int(av[4])
if __name__ == "__main__":
pool = Pool(12)
with open(filename,"r") as source_file:
# chunk the work into batches of 4 lines at a time
results = pool.map(process_line, source_file, 1)
我不知道,为什么
答案 0 :(得分:3)
您的任务是IO限制的(您花费更多时间阅读文件而不是处理数据),因此线程在这里没有多大帮助。
对于你的线程,Python只在单个核心上运行,因此线程只在有多个阻塞IO的任务(如Web服务器)时才有用。你不是让一个工作者(一个核心)在一堆工作上工作,而是将堆栈拆分为四个,但仍然只有一个工人,只有现在一个工人必须在任务之间分配时间并处理同步记录保存。 / p>
尝试更像这样的东西。它可以减少内存开销(如果您正在进行大量重新分配或分配并释放大量内容,这可以提高CPU性能),并且无需将值存储在工作队列中。
#!/usr/bin/python
import sys
from collections import defaultdict
filename = sys.argv[1]
name = defaultdict(int)
with open(filename,"r") as s:
for i in s:
av = i.split()
name[av[7]] += int(av[4])
#print result to ziyou_end
with open("ziyou_end","w") as mm:
for k, v in name.iteritems():
mm.write("%s %s\n" % (k, v))