对批处理并行运行python进程(使用多处理池)

时间:2014-08-06 12:28:45

标签: python shell optimization multiprocessing

请参阅,我需要编写一个代码,用于批量运行约25万个输入文件。我看到这篇文章:https://codereview.stackexchange.com/questions/20416/python-parallelization-using-popen

我无法弄清楚如何在我的代码中实现这一点。

我想要什么

  • 我想给每个进程指定特定数量的核心,换句话说,特定数量的进程只能在特定时间运行。

  • 如果一个过程完成,另一个过程应该取而代之。

我的代码(使用子流程)

Main.py

import subprocess
import os
import multiprocessing
import time
MAXCPU = multiprocessing.cpu_count()

try:
    cp = int(raw_input("Enter Number of CPU's to use (Total %d) = "%MAXCPU))
    assert cp <= MAXCPU
except:
    print "Bad command taking all %d cores"%MAXCPU
    cp =MAXCPU  # set MAXCPU as CPU

list_pdb = [i for i in os.listdir(".") if i.endswith(".pdb")]  # Input PDB files
assert len(list_pdb) != 0

c = {}
d = {}
t = {}

devnull = file("Devnull","wb")
for each in range(0, len(list_pdb), cp):   # Number of cores in Use = 4
    for e in range(cp):
        if each + e < len(list_pdb):
            args = ["sh", "Child.sh", list_pdb[each + e], str(cp)]
            p = subprocess.Popen(args, shell=False,
                stdout=devnull, stderr=devnull)
            c[p.pid] = p
            print "Started Process : %s" % list_pdb[each + e]
    while c:
        print c.keys()
        pid, status = os.wait()
        if pid in c:
            print "Ended Process"
            del c[pid]
devnull.close()

Child.sh

#!/bin/sh
sh grand_Child.sh
sh grand_Child.sh
sh grand_Child.sh
sh grand_Child.sh
# Some heavy processes with $1

grand_Child.sh

#!/bin/sh
sleep 5

输出

enter image description here

1 个答案:

答案 0 :(得分:2)

以下是使用multiprocessing.Pool的代码版本。它简单得多,因为该模块几乎可以完成所有工作!

这个版本也有:

  • 当proc开始/结束

  • 时有很多记录
  • 打印将处理的文件数

  • 允许您一次处理超过numcpus

通常在运行多进程作业时,最好运行更多进程而不是CPU。不同的进程将等待I / O,等待CPU。通常人们运行2n + 1,因此对于4 proc系统,他们运行2 * 4 + 1或9个proc来完成一项工作。 (我通常硬编码“5”或“10”,直到有理由改变,我这样懒惰:))

享受!

import glob
import multiprocessing
import os
import subprocess

MAXCPU = multiprocessing.cpu_count()
TEST = False

def do_work(args):
    path,numproc = args
    curproc = multiprocessing.current_process()
    print curproc, "Started Process, args={}".format(args)
    devnull = open(os.devnull, 'w')
    cmd = ["sh", "Child.sh", path, str(numproc)]
    if TEST:
        cmd.insert(0, 'echo')
    try:
        return subprocess.check_output(
            cmd, shell=False,
            stderr=devnull,
        )
    finally:
        print curproc, "Ended Process"

if TEST:
    cp = MAXCPU
    list_pdb = glob.glob('t*.py')
else:
    cp = int(raw_input("Enter Number of processes to use (%d CPUs) = " % MAXCPU))
    list_pdb = glob.glob('*.pdb') # Input PDB files
# assert cp <= MAXCPU

print '{} files, {} procs'.format(len(list_pdb), cp)
assert len(list_pdb) != 0

pool = multiprocessing.Pool(cp)
print pool.map(
    do_work, [ (path,cp) for path in list_pdb ],
)
pool.close()
pool.join()

输出

27 files, 4 procs
<Process(PoolWorker-2, started daemon)> Started Process, args=('tdownload.py', 4)
<Process(PoolWorker-2, started daemon)> Ended Process
<Process(PoolWorker-2, started daemon)> Started Process, args=('tscapy.py', 4)
<Process(PoolWorker-2, started daemon)> Ended Process