我编写了两个版本的程序来解析日志文件并返回与给定正则表达式匹配的字符串数。单线程版本返回正确的输出
Number of Orders ('ORDER'): 1108
Number of Replacements ('REPLACE'): 742
Number of Orders and Replacements: 1850
Time to process: 5.018553
多线程程序会返回错误的值:
Number of Orders ('ORDER'): 1579
Number of Replacements ('REPLACE'): 1108
Number of Orders and Replacements: 2687
Time to process: 2.783091
时间可以变化(多线程的时间应该更快)但我似乎无法找到为什么两个版本之间订单和替换的值不同。
以下是多线程版本:
import re
import time
import sys
import threading
import Queue
class PythonLogParser:
queue = Queue.Queue()
class FileParseThread(threading.Thread):
def __init__(self, parsefcn, f, startind, endind, olist):
threading.Thread.__init__(self)
self.parsefcn = parsefcn
self.startind = startind
self.endind = endind
self.olist = olist
self.f = f
def run(self):
self.parsefcn(self.f, self.startind, self.endind, self.olist)
def __init__(self, filename):
assert(len(filename) != 0)
self.filename = filename
self.start = 0
self.end = 0
def open_file(self):
f = None
try:
f = open(self.filename)
except IOError as e:
print 'Unable to open file:', e.message
return f
def count_orders_from(self, f, starting, ending, offset_list):
f.seek(offset_list[starting])
order_pattern = re.compile(r'.*(IN:)(\s)*(ORDER).*(ord_type)*')
replace_pattern = re.compile(r'.*(IN:)(\s)*(REPLACE).*(ord_type)*')
order_count=replace_count = 0
for line in f:
if order_pattern.match(line) != None:
order_count+=1 # = order_count + 1
if replace_pattern.match(line) != None:
replace_count+=1 # = replace_count + 1
#return (order_count, replace_count, order_count+replace_count)
self.queue.put((order_count, replace_count, order_count+replace_count))
def get_file_data(self):
offset_list = []
offset = 0
num_lines = 0
f = 0
try:
f = open(self.filename)
for line in f:
num_lines += 1
offset_list.append(offset)
offset += len(line)
f.close()
finally:
f.close()
return (num_lines, offset_list)
def count_orders(self):
self.start = time.clock()
num_lines, offset_list = self.get_file_data()
start_t1 = 0
end_t1 = num_lines/2
start_t2 = end_t1 + 1
f = open(self.filename)
t1 = self.FileParseThread(self.count_orders_from, f, start_t1, end_t1, offset_list)
self.count_orders_from(f, start_t2, num_lines, offset_list)
t1.start()
self.end = time.clock()
tup1 = self.queue.get()
tup2 = self.queue.get()
order_count1, replace_count1, sum1 = tup1
order_count2, replace_count2, sum2 = tup2
print 'Number of Orders (\'ORDER\'): {0}\n'\
'Number of Replacements (\'REPLACE\'): {1}\n'\
'Number of Orders and Replacements: {2}\n'\
'Time to process: {3}\n'.format(order_count1+order_count2, \
replace_count1+replace_count2, \
sum1+sum2, \
self.end - self.start)
f.close()
def test2():
p = PythonLogParser('../../20150708.aggregate.log')
p.count_orders()
def main():
test2()
main()
这个想法是,由于文件很大,每个线程将读取文件的一半。 t1读取前半部分,主线程读取第二部分。然后主线程将两次迭代的结果相加并显示它们。
我的怀疑是,不知何故,count_orders_from中的order_count和replace_count在线程之间被修改而不是从每个线程的0开始,但我不确定,因为我不明白为什么从2个单独的线程单独调用方法会修改相同的变量。
答案 0 :(得分:0)
发生错误是因为即使理论上线程正在解析各个部分,实际上发生的事情是一个线程中途解析而另一个线程解析完整文件,因此项目被重复计算。通过向count_orders_from添加linecount变量来修复此错误,以便检查读取器是否已到达应该读取的行。
def count_orders_from(self, f, starting, ending, offset_list):
f.seek(offset_list[starting])
order_pattern = re.compile(r'.*(IN:)(\s)*(ORDER).*(ord_type)*')
replace_pattern = re.compile(r'.*(IN:)(\s)*(REPLACE).*(ord_type)*')
order_count=replace_count=linecount = 0
for line in f:
if order_pattern.match(line) != None:
order_count+=1 # = order_count + 1
if replace_pattern.match(line) != None:
replace_count+=1 # = replace_count + 1
if linecount==ending:
break
linecount+=1
self.queue.put((order_count, replace_count, order_count+replace_count))