Question

我使用多处理来处理110组4个文件：即总共440个文件这些文件将具有相同数量的行（1040万i），因此我的for循环结构的方式没有迭代器应该在我的代码结构的方式之前完成。

将文件解析为collections.Counter对象，并将Counter dicts聚合以供进一步分析

我的代码中有不可复制的StopIterations，有时我在一个多处理过程中得到一个StopIteration，有时在四个进程中有两个具有相同的输入数据。我究竟做错了什么？我从来没有使用较少量的测试数据来停止迭代。

我的伪代码

  def main(*args):
        #code that sets up dicts to route Counter data

        def dict_populator_worker_process(*input_file_tuple_list):
            worker_dict = Counter()
            my_subproc_read_dict = defaultdict(list)

            for index_file1 , index_file2 ,run_file1,run_file2 in input_file_tuple_list:
                index1_file_handle = open(index_file1,"rUb")
                index2_file_handle = open(index_file2,"rUb")
                run1_file_handle = open(run_file1,"rUb")
                run2_file_handle = open(run_file2,"rUb")
                for line in index1_file_handle:
                    index2_file_handle.next()
                    index_for_read = (index1_file_handle.next().strip(),index2_file_handle.next().strip())
                    worker_dict.update((index_for_read,))
                    for i in range(4):
                        try:
                            # THIS IS WHERE I SHOULD NOT BE Exhausting the Iterator 
                            my_subproc_read_dict[(index_for_read,1)].append(run1_file_handle.next())
                            my_subproc_read_dict[(index_for_read,2)].append(run2_file_handle.next())
                        except StopIteration:
                            # sometimes get this undeservedly
                            pass
                    # logger.info(index_for_read)
                    index1_file_handle.next() # Handles the +
                    index2_file_handle.next() # Handles the +
                    index1_file_handle.next() #Handles the Q
                    index2_file_handle.next() # Handles the Q

            # logger.info(worker_dict.keys())
            pid = multiprocessing.current_process()
            pickle.dump(worker_dict,open("counter_dict_{}.p".format(pid),"wb"))
            pickle.dump(my_subproc_read_dict,open("my_subproc_read_dict_{}.p".format(pid),"wb"))

我想知道为什么我得到一个StopIteration如果迭代器在步骤中行进并且所有文件具有相同数量的行（由linux shell拆分设置，包括相等的尾部文件）。要设置文件，我会这样做。

all_index_file_tuples = zip(glob.glob("file1*pat.txt"),glob.glob("file2*pat.txt"),glob.glob("file3*pat.txt"),glob.glob("file4*pat.txt"))


chunksize = int(math.ceil(len(all_index_file_tuples) / float(NUM_PROCS)))
procs = []

for i in range(NUM_PROCS):
    p = multiprocessing.Process(target=dict_populator_worker_process , args = (all_index_file_tuples[chunksize * i:chunksize * (i + 1)]))
    procs.append(p)
    p.start()

在文件多处理期间意外地引发了StopIteration

0 个答案: