Question

我遇到的问题类似于Python - How to parallel consume and operate on files in a directory中提到的问题。

问题：我的目录中有10万多个文件。在我的情况下，process_file（）获取一个文本文件，进行一些处理并转储一个xml文件。

与上述线程不同，我要使用成批文件运行池映射。

批量运行的原因：每个文件的平均处理时间为一分钟。因此，将需要几天的时间才能完成整个文件列表的处理。但是随着文件的处理，我想开始将处理后的文件用于另一个程序。为此，我想确保已经准备好前100个文件，然后再说100个，依此类推。

我已经执行以下操作：

对目录中的文件进行排序。 inputFileArr是文件列表。

批量运行程序：

for i in range(int(math.ceil(len(inputFileArr) * 1.0 / batch_size))):

 start_index = i * batch_size
 end_index = (i + 1) * batch_size
 print("Batch #{0}: {1}".format(i, inputFileArr[start_index:end_index]))

 p = Pool(n_process)
 p.map(process_file, inputFileArr[start_index:end_index])
 print("Batch #{0} completed".format(i))

python documentation of pool.map提及

它会阻塞直到结果准备就绪。

我认为这意味着只有在批处理#i的文件处理结束之后，批处理＃（i + 1）才会开始。

但是事实并非如此。当我看到生成的xml文件的时间戳时，它表明未维护批处理的顺序。我看到一批文件中的某些文件在上一批文件之前得到处理。为了确保我已经打印了每个批次的文件名。

process_file（）

这将使用subprocess.Popen（）调用python脚本。

subprocess.Popen（命令）

命令包含类似python script.py input_args
并且该python脚本使用subprocess.Popen（）

这是python脚本中的代码，由我的python代码调用：

        m_process = subprocess.Popen(command, stdout=subprocess.PIPE)
        while m_process.poll() is None:
            stdout = str(m_process.stdout.readline())
            if 'ERROR' in stdout:
                m_process.terminate()
                error = stdout.rstrip()
        output = str(output_file.read())

我该怎么做才能确保程序按批处理顺序进行？

环境：Python 2.7

Answer 1

编辑：下面是旧答案，顶部是新答案

等待前100个文件完成然后再执行下一个是有点效率低下的（因为当批处理中的最后一个文件正在运行时，如果您有空闲的工作程序，则可以开始处理下一个文件）。

尽管如此，如果您真的希望仅在完成前100个文件之后才继续处理下一个100个文件，只需一次对100个文件批次调用map。

files = sorted(...)
for i in range(0, len(files), 100):
    pool.map(files[i:i+100])

根据您有多少工人，我建议将批量大小增加到100以上，以减少闲置工人的时间（如上所述）。

假设您只希望包含100个连续文件的组，但不一定要从头开始，则可以尝试以下操作。

按照数学上的建议，我想说您可以将文件分成100组，然后在单独的工作程序中处理每个组（因此并行化是在组上进行的，但是一旦完成每个组，您就知道100连续文件被处理）。

files = sorted(...)
file_groups = [[files[i + j] for j in range(min(100, len(files) - i))]
               for i in range(0, len(files), 100]

def process_batch(batch):
    group_index, group_files = batch
    for f in group_files:
        process_file(f)
    print('Group %d is done' % group_index)

pool.map(process_batch, enumerate(file_groups))

假设您只希望包含100个连续文件的组，但不一定要从头开始，则可以尝试以下操作。

按照数学上的建议，我想说您可以将文件分成100组，然后在单独的工作程序中处理每个组（因此并行化是在组上进行的，但是一旦完成每个组，您就知道100连续文件被处理）。

files = sorted(...)
file_groups = [[files[i + j] for j in range(min(100, len(files) - i))]
               for i in range(0, len(files), 100]

def process_batch(batch):
    group_index, group_files = batch
    for f in group_files:
        process_file(f)
    print('Group %d is done' % group_index)

pool.map(process_batch, enumerate(file_groups))

Answer 2

通过将 subprocess.Popen（命令）替换为 subprocess.call（命令）来解决此问题。

感谢@Barak Itkin的帮助，并指出使用等待。遵循了Python popen command. Wait until the command is finished

中提供的解决方案（使用subprocess.call）

在此提及解决方案，以防其他用户遇到类似问题。

Python-目录中文件的并行+批处理

2 个答案: