请参阅,我需要编写一个代码,用于批量运行约25万个输入文件。我看到这篇文章:https://codereview.stackexchange.com/questions/20416/python-parallelization-using-popen
我无法弄清楚如何在我的代码中实现这一点。
我想要什么
我想给每个进程指定特定数量的核心,换句话说,特定数量的进程只能在特定时间运行。
如果一个过程完成,另一个过程应该取而代之。
我的代码(使用子流程)
Main.py
import subprocess
import os
import multiprocessing
import time
MAXCPU = multiprocessing.cpu_count()
try:
cp = int(raw_input("Enter Number of CPU's to use (Total %d) = "%MAXCPU))
assert cp <= MAXCPU
except:
print "Bad command taking all %d cores"%MAXCPU
cp =MAXCPU # set MAXCPU as CPU
list_pdb = [i for i in os.listdir(".") if i.endswith(".pdb")] # Input PDB files
assert len(list_pdb) != 0
c = {}
d = {}
t = {}
devnull = file("Devnull","wb")
for each in range(0, len(list_pdb), cp): # Number of cores in Use = 4
for e in range(cp):
if each + e < len(list_pdb):
args = ["sh", "Child.sh", list_pdb[each + e], str(cp)]
p = subprocess.Popen(args, shell=False,
stdout=devnull, stderr=devnull)
c[p.pid] = p
print "Started Process : %s" % list_pdb[each + e]
while c:
print c.keys()
pid, status = os.wait()
if pid in c:
print "Ended Process"
del c[pid]
devnull.close()
Child.sh
#!/bin/sh
sh grand_Child.sh
sh grand_Child.sh
sh grand_Child.sh
sh grand_Child.sh
# Some heavy processes with $1
grand_Child.sh
#!/bin/sh
sleep 5
答案 0 :(得分:2)
以下是使用multiprocessing.Pool
的代码版本。它简单得多,因为该模块几乎可以完成所有工作!
这个版本也有:
当proc开始/结束
打印将处理的文件数
允许您一次处理超过numcpus
通常在运行多进程作业时,最好运行更多进程而不是CPU。不同的进程将等待I / O,等待CPU。通常人们运行2n + 1,因此对于4 proc系统,他们运行2 * 4 + 1或9个proc来完成一项工作。 (我通常硬编码“5”或“10”,直到有理由改变,我这样懒惰:))
享受!
import glob
import multiprocessing
import os
import subprocess
MAXCPU = multiprocessing.cpu_count()
TEST = False
def do_work(args):
path,numproc = args
curproc = multiprocessing.current_process()
print curproc, "Started Process, args={}".format(args)
devnull = open(os.devnull, 'w')
cmd = ["sh", "Child.sh", path, str(numproc)]
if TEST:
cmd.insert(0, 'echo')
try:
return subprocess.check_output(
cmd, shell=False,
stderr=devnull,
)
finally:
print curproc, "Ended Process"
if TEST:
cp = MAXCPU
list_pdb = glob.glob('t*.py')
else:
cp = int(raw_input("Enter Number of processes to use (%d CPUs) = " % MAXCPU))
list_pdb = glob.glob('*.pdb') # Input PDB files
# assert cp <= MAXCPU
print '{} files, {} procs'.format(len(list_pdb), cp)
assert len(list_pdb) != 0
pool = multiprocessing.Pool(cp)
print pool.map(
do_work, [ (path,cp) for path in list_pdb ],
)
pool.close()
pool.join()
27 files, 4 procs
<Process(PoolWorker-2, started daemon)> Started Process, args=('tdownload.py', 4)
<Process(PoolWorker-2, started daemon)> Ended Process
<Process(PoolWorker-2, started daemon)> Started Process, args=('tscapy.py', 4)
<Process(PoolWorker-2, started daemon)> Ended Process