Question

我想研究多处理。我有'tar'存档，假设有1000个文件（确实有更多文件），每个文件有1000行。我需要读取每个文件和文件的每一行。我需要在某些'result'变量（字典）中返回并保存有关每个文件的信息。我有下一个代码，由于一些未知的原因，它在8次迭代后停止：

class DataProc():
...

def data_proc(self):
    ...
    result = {}
    read_mode = 'r'
    self.tar = tarfile.open(file_path, read_mode)
    for file in self.tar:
        q = Queue()
        p = Process(target=process_tar,
                    args=(file, q))
        p.start()
        tmp_result = q.get()
        for key, item in tmp_result.items():
            '''
            do some logic and save data to result
            '''
            pass
        p.join()
    return result

def process_tar(self, file, q):
    output = {}
    extr_file = self.tar.extractfile(file)
    content = extr_file.readlines()
    '''
    do some data processing with file content
    save result to output
    '''
    q.put(output)

dp = DataProc() 
result = dp.data_proc()

'对于self.tar中的文件'只进行了8次迭代。我做错了什么？

Answer 1

我在发布的代码中看到了一些问题。

主进程打开文件但不关闭它。如果你有1K文件，你将用完文件描述符。而是将文件路径传递给子进程并让它打开它。

同样，您将生成1K进程，这在普通计算机上很难处理。您正在管理这些流程，使用池会减少大量复杂性，从而简化您的代码。

子进程产生的输出有多大？如果太大，可能是它卡住的原因之一。

最后，混合OOP和多处理是一种相当容易出错的做法（AKA不会将自己传递给子进程）。

这样的事情会削减大部分无用的复杂性（假设Python 2）。

from multiprocessing import Pool

files_path = [ ... list of path of archive files ... ]

def process_archive(file_path):
    with open(file_path):
        ... processing archive ...

pool = Pool()

for result in pool.map(function, files_path):
    ... enjoy the results ...

Python多处理。循环存档文件

1 个答案: