Question

我希望您提供以下内容：

我很快就开始在一家新公司工作，这家公司正在集群上运行流程。现有的管道已经实施，大致如下：

大文件以+ - 200存储在硬盘上（每个文件+ - 130gb）
由于群集上有磁盘配额，并且要进行非常密集的IO复制，我必须限制自己一次只复制一个文件。
经理java程序将创建一个拉动脚本，以便通过网络（从NAS到群集）提取大文件。
拉动后，分析管道在群集上运行（对我来说是黑盒子流程）。
接下来，“我完成了”脚本正在检查群集上的流程是否已完成。如果不是，脚本会休眠10分钟并再次检查，如果完成，大文件将被删除（黑盒脚本给我）。

所以目前我已经在python中实现了一个非常简单的经理程序实现，这样做，完成后，执行下一个文件进行复制并重复工作列表。

我的问题是，我想扩展这个程序，以便它将使用5个（可能更晚一些）大文件一次复制，并将它们提交到集群，并且只有在完成后才删除并删除它运行

在寻找解决方案时，我看到人们提到使用多线程或多处理，特别是使用工作池。我还没有这方面的经验（但是人们可以正确地学习吗？）但我认为在这种情况下，这将是一个可行的选择。我的问题是，如何设置一个由5名工作人员组成的池，以便每个工作人员执行一系列任务，一旦完成，从队列中获取一个新的大文件并进行迭代。

Answer 1

multiprocessing.Pool专为此用例而设计：

import multiprocessing

def process_big_file(big_file):
    print("Process {0}: Got big file {1}".format(multiprocessing.current_process(), big_file))
    return "done with {0}".format(big_file)

def get_big_file_list():
    return ['bf{0}'.format(i) for i in range(20)]  # Just a dummy list


if __name__ == "__main__":
    pool = multiprocessing.Pool(5)  # 5 worker processes in the pool
    big_file_list = get_big_file_list()
    results = pool.map(process_big_file, big_file_list)
    print(results)

输出：

Process <Process(PoolWorker-1, started daemon)>: Got big file bf0
Process <Process(PoolWorker-1, started daemon)>: Got big file bf1
Process <Process(PoolWorker-3, started daemon)>: Got big file bf2
Process <Process(PoolWorker-4, started daemon)>: Got big file bf3
Process <Process(PoolWorker-5, started daemon)>: Got big file bf4
Process <Process(PoolWorker-5, started daemon)>: Got big file bf5
Process <Process(PoolWorker-5, started daemon)>: Got big file bf6
Process <Process(PoolWorker-3, started daemon)>: Got big file bf7
Process <Process(PoolWorker-3, started daemon)>: Got big file bf8
Process <Process(PoolWorker-2, started daemon)>: Got big file bf9
Process <Process(PoolWorker-2, started daemon)>: Got big file bf10
Process <Process(PoolWorker-2, started daemon)>: Got big file bf11
Process <Process(PoolWorker-4, started daemon)>: Got big file bf12
Process <Process(PoolWorker-4, started daemon)>: Got big file bf13
Process <Process(PoolWorker-4, started daemon)>: Got big file bf14
Process <Process(PoolWorker-4, started daemon)>: Got big file bf15
Process <Process(PoolWorker-4, started daemon)>: Got big file bf16
Process <Process(PoolWorker-4, started daemon)>: Got big file bf17
Process <Process(PoolWorker-4, started daemon)>: Got big file bf18
Process <Process(PoolWorker-4, started daemon)>: Got big file bf19

['done with bf0', 'done with bf1', 'done with bf2', 'done with bf3', 'done with bf4', 'done with bf5', 'done with bf6', 'done with bf7', 'done with bf8', 'done with bf9', 'done with bf10', 'done with bf11', 'done with bf12', 'done with bf13', 'done with bf14', 'done with bf15', 'done with bf16', 'done with bf17', 'done with bf18', 'done with bf19']

pool.map调用使用内部队列将big_file_list中的所有项目分发给队列中的worker。一旦工作人员完成任务，它只会将下一个项目从队列中拉出，并继续直到队列为空。

在集群环境中在python中设置管理器

1 个答案: