我使用多处理来处理110组4个文件:即总共440个文件 这些文件将具有相同数量的行(1040万i),因此我的for循环结构的方式没有迭代器应该在我的代码结构的方式之前完成。
将文件解析为collections.Counter对象,并将Counter dicts聚合以供进一步分析
我的代码中有不可复制的StopIterations,有时我在一个多处理过程中得到一个StopIteration,有时在四个进程中有两个具有相同的输入数据。我究竟做错了什么 ?我从来没有使用较少量的测试数据来停止迭代。
我的伪代码
def main(*args):
#code that sets up dicts to route Counter data
def dict_populator_worker_process(*input_file_tuple_list):
worker_dict = Counter()
my_subproc_read_dict = defaultdict(list)
for index_file1 , index_file2 ,run_file1,run_file2 in input_file_tuple_list:
index1_file_handle = open(index_file1,"rUb")
index2_file_handle = open(index_file2,"rUb")
run1_file_handle = open(run_file1,"rUb")
run2_file_handle = open(run_file2,"rUb")
for line in index1_file_handle:
index2_file_handle.next()
index_for_read = (index1_file_handle.next().strip(),index2_file_handle.next().strip())
worker_dict.update((index_for_read,))
for i in range(4):
try:
# THIS IS WHERE I SHOULD NOT BE Exhausting the Iterator
my_subproc_read_dict[(index_for_read,1)].append(run1_file_handle.next())
my_subproc_read_dict[(index_for_read,2)].append(run2_file_handle.next())
except StopIteration:
# sometimes get this undeservedly
pass
# logger.info(index_for_read)
index1_file_handle.next() # Handles the +
index2_file_handle.next() # Handles the +
index1_file_handle.next() #Handles the Q
index2_file_handle.next() # Handles the Q
# logger.info(worker_dict.keys())
pid = multiprocessing.current_process()
pickle.dump(worker_dict,open("counter_dict_{}.p".format(pid),"wb"))
pickle.dump(my_subproc_read_dict,open("my_subproc_read_dict_{}.p".format(pid),"wb"))
我想知道为什么我得到一个StopIteration如果迭代器在步骤中行进并且所有文件具有相同数量的行(由linux shell拆分设置,包括相等的尾部文件)。要设置文件,我会这样做。
all_index_file_tuples = zip(glob.glob("file1*pat.txt"),glob.glob("file2*pat.txt"),glob.glob("file3*pat.txt"),glob.glob("file4*pat.txt"))
chunksize = int(math.ceil(len(all_index_file_tuples) / float(NUM_PROCS)))
procs = []
for i in range(NUM_PROCS):
p = multiprocessing.Process(target=dict_populator_worker_process , args = (all_index_file_tuples[chunksize * i:chunksize * (i + 1)]))
procs.append(p)
p.start()