我正在尝试使用多处理模块加速一些数据处理,我的想法是我可以向每个进程发送一大块数据,我开始利用我机器上的所有核心,而不是一次只使用一个核心。
所以我使用pandas read_fwf()函数为数据构建了一个迭代器,一次使用chunksize = 50000行。我的问题是最终迭代器应该引发StopIteration,我试图在子进程的一个except块中捕获它并使用Queue将它传递给父线程,让父进程知道它可以停止生成子进程。我不知道出了什么问题,但是发生的事情是它到达数据的末尾然后继续产生基本上什么都不做的过程。
def MyFunction(data_iterator, results_queue, Placeholder, message_queue):
try:
current_data = data_iterator.next()
#does other stuff here
#that isn't important
placeholder_result = "Eggs and Spam"
results_queue.put(placeholder_result)
return None
except StopIteration:
message_queue.put("Out Of Data")
return None
results_queue = Queue() #for passing results from each child process
message_queue = Queue() #for passing the stop iteration message
cpu_count = cpu_count() #num of cores on the machine
Data_Remaining = True #loop control
output_values = [] #list to put results in
print_num_records = 0 #used to print how many lines have been processed
my_data_file = "some_data.dat"
data_iterator = BuildDataIterator(my_data_file)
while Data_Remaining:
processes = []
for process_num in range(cpu_count):
if __name__ == "__main__":
p = Process(target=MyFunction, args=(data_iterator,results_queue,Placeholder, message_queue))
processes.append(p)
p.start()
print "Process " + str(process_num) + " Started" #print some stuff to
print_num_records = print_num_records + 50000 #show how far along
print "Processing records through: ", print_num_records #my data file I am
for i,p in enumerate(processes):
print "Joining Process " + str(i)
output_values.append(results_queue.get())
p.join(None)
if not message_queue.empty():
message = message_queue.get()
else:
message = ""
if message == "Out Of Data":
Data_Remaining = False
print "STOP ITERATION NOW PLEASE"
更新: 我发现数据迭代器有问题。我的数据集中大约有800万行,在处理了800万行后,它实际上从未返回StopIteration,它会一遍又一遍地返回相同的14行数据。以下是构建我的数据迭代器的代码:
def BuildDataIterator(my_data_file):
#data_columns is a list of 2-tuples
#headers is a list of strings
#num_lines is 50000
data_reader = read_fwf(my_data_file, colspecs=data_columns, header=None, names=headers, chunksize=num_lines)
data_iterator = data_reader.__iter__()
return data_iterator