Question

我正在使用多处理池来管理tesseract过程（缩微胶片的OCRing页面）。通常，在说20个tesseract流程的过程中，OCR很难处理几页，因此这些流程所花的时间比其他流程长得多。同时，该池刚刚挂起，大多数CPU没有被利用。我希望这些散乱的人继续留下来，但我也想启动更多的进程来填满现在正在闲置的许多其他CPU，而这几个粘滞页面正在整理中。我的问题是：有没有办法加载新进程来利用那些空闲的CPU。换句话说，是否可以在等待整个池完成之前填充池中的空白点？

我可以使用starmap的异步版本，然后在当前池下降到一定数量的活动进程时加载新池。但这似乎不雅。自动根据需要在进程中保留插槽会更优雅。

这是我的代码现在的样子：

def getMpBatchMap(fileList, commandTemplate, concurrentProcesses):
    mpBatchMap = []
    for i in range(concurrentProcesses):
        fileName = fileList.readline()
        if fileName:
            mpBatchMap.append((fileName, commandTemplate))
    return mpBatchMap

def executeSystemProcesses(objFileName, commandTemplate):
    objFileName = objFileName.strip()
    logging.debug(objFileName)
    objDirName = os.path.dirname(objFileName)
    command = commandTemplate.substitute(objFileName=objFileName, objDirName=objDirName)
    logging.debug(command)
    subprocess.call(command, shell=True)

def process(FILE_LIST_FILENAME, commandTemplateString, concurrentProcesses=3):
    """Go through the list of files and run the provided command against them,
    one at a time. Template string maps the terms $objFileName and $objDirName.

    Example:
    >>> runBatchProcess('convert -scale 256 "$objFileName" "$objDirName/TN.jpg"')
    """
    commandTemplate = Template(commandTemplateString)
    with open(FILE_LIST_FILENAME) as fileList:
        while 1:
            # Get a batch of x files to process
            mpBatchMap = getMpBatchMap(fileList, commandTemplate, concurrentProcesses)
            # Process them
            logging.debug('Starting MP batch of %i' % len(mpBatchMap))
            if mpBatchMap:
                with Pool(concurrentProcesses) as p:
                    poolResult = p.starmap(executeSystemProcesses, mpBatchMap)
                    logging.debug('Pool result: %s' % str(poolResult))
            else:
                break

Answer 1

您在这里混了一些东西。池始终保持许多指定进程处于活动状态。只要您不手动关闭池或通过退出上下文管理器的with块而关闭池，就无需为进程重新填充池，因为它们不会去任何地方。

您可能要说的是“任务”，这些流程可以执行的任务。任务是传递给池方法的可迭代对象的每个进程块。是的，有一种方法可以在处理所有先前排队的任务之前，在池中为新任务使用空闲进程。您已经为此选择了正确的工具，即池方法的async-versions。您要做的就是重新应用某种异步池方法。

from multiprocessing import Pool
import os

def busy_foo(x):
    x = int(x)
    for _ in range(x):
        x - 1
    print(os.getpid(), ' returning: ', x)
    return x

if __name__ == '__main__':

    arguments1 = zip([222e6, 22e6] * 2)
    arguments2 = zip([111e6, 11e6] * 2)

    with Pool(4) as pool:

        results = pool.starmap_async(busy_foo, arguments1)
        results2 = pool.starmap_async(busy_foo, arguments2)

        print(results.get())
        print(results2.get())

示例输出：

3182  returning:  22000000
3185  returning:  22000000
3185  returning:  11000000
3182  returning:  111000000
3182  returning:  11000000
3185  returning:  111000000
3181  returning:  222000000
3184  returning:  222000000
[222000000, 22000000, 222000000, 22000000]
[111000000, 11000000, 111000000, 11000000]

Process finished with exit code 0

请注意，进程3182和3185最终以较简单的任务结束，立即从第二个参数列表开始执行任务，而无需等待3181和3184首先完成。

如果由于某种原因，您真的想在每个进程处理了一定数量的任务之后使用新进程，那么maxtasksperchild的参数为Pool。您可以在其中指定池应在多少个任务之后用新任务替换旧进程。此参数的默认值为None，因此默认情况下，池不替换进程。

有没有一种方法可以使用Python多处理池在旧进程完成时启动新进程？

1 个答案: