我希望您提供以下内容:
我很快就开始在一家新公司工作,这家公司正在集群上运行流程。现有的管道已经实施,大致如下:
所以目前我已经在python中实现了一个非常简单的经理程序实现,这样做,完成后,执行下一个文件进行复制并重复工作列表。
我的问题是,我想扩展这个程序,以便它将使用5个(可能更晚一些)大文件一次复制,并将它们提交到集群,并且只有在完成后才删除并删除它运行
在寻找解决方案时,我看到人们提到使用多线程或多处理,特别是使用工作池。我还没有这方面的经验(但是人们可以正确地学习吗?)但我认为在这种情况下,这将是一个可行的选择。 我的问题是,如何设置一个由5名工作人员组成的池,以便每个工作人员执行一系列任务,一旦完成,从队列中获取一个新的大文件并进行迭代。
答案 0 :(得分:1)
multiprocessing.Pool
专为此用例而设计:
import multiprocessing
def process_big_file(big_file):
print("Process {0}: Got big file {1}".format(multiprocessing.current_process(), big_file))
return "done with {0}".format(big_file)
def get_big_file_list():
return ['bf{0}'.format(i) for i in range(20)] # Just a dummy list
if __name__ == "__main__":
pool = multiprocessing.Pool(5) # 5 worker processes in the pool
big_file_list = get_big_file_list()
results = pool.map(process_big_file, big_file_list)
print(results)
输出:
Process <Process(PoolWorker-1, started daemon)>: Got big file bf0
Process <Process(PoolWorker-1, started daemon)>: Got big file bf1
Process <Process(PoolWorker-3, started daemon)>: Got big file bf2
Process <Process(PoolWorker-4, started daemon)>: Got big file bf3
Process <Process(PoolWorker-5, started daemon)>: Got big file bf4
Process <Process(PoolWorker-5, started daemon)>: Got big file bf5
Process <Process(PoolWorker-5, started daemon)>: Got big file bf6
Process <Process(PoolWorker-3, started daemon)>: Got big file bf7
Process <Process(PoolWorker-3, started daemon)>: Got big file bf8
Process <Process(PoolWorker-2, started daemon)>: Got big file bf9
Process <Process(PoolWorker-2, started daemon)>: Got big file bf10
Process <Process(PoolWorker-2, started daemon)>: Got big file bf11
Process <Process(PoolWorker-4, started daemon)>: Got big file bf12
Process <Process(PoolWorker-4, started daemon)>: Got big file bf13
Process <Process(PoolWorker-4, started daemon)>: Got big file bf14
Process <Process(PoolWorker-4, started daemon)>: Got big file bf15
Process <Process(PoolWorker-4, started daemon)>: Got big file bf16
Process <Process(PoolWorker-4, started daemon)>: Got big file bf17
Process <Process(PoolWorker-4, started daemon)>: Got big file bf18
Process <Process(PoolWorker-4, started daemon)>: Got big file bf19
['done with bf0', 'done with bf1', 'done with bf2', 'done with bf3', 'done with bf4', 'done with bf5', 'done with bf6', 'done with bf7', 'done with bf8', 'done with bf9', 'done with bf10', 'done with bf11', 'done with bf12', 'done with bf13', 'done with bf14', 'done with bf15', 'done with bf16', 'done with bf17', 'done with bf18', 'done with bf19']
pool.map
调用使用内部队列将big_file_list
中的所有项目分发给队列中的worker。一旦工作人员完成任务,它只会将下一个项目从队列中拉出,并继续直到队列为空。