早期终止时Python的多处理死锁

时间:2017-05-10 18:31:44

标签: python python-multiprocessing

我在Python中创建multiprocessing.Queue并向multiprocessing.Process添加Queue个实例。

我想添加一个在每job之后执行的函数调用,它会检查特定任务是否成功。如果是这样,我想清空Queue并终止执行。

我的Process课程是:

class Worker(multiprocessing.Process):

    def __init__(self, queue, check_success=None, directory=None, permit_nonzero=False):
        super(Worker, self).__init__()
        self.check_success = check_success
        self.directory = directory
        self.permit_nonzero = permit_nonzero
        self.queue = queue

    def run(self):
        for job in iter(self.queue.get, None):
            stdout = mbkit.dispatch.cexectools.cexec([job], directory=self.directory, permit_nonzero=self.permit_nonzero)
            with open(job.rsplit('.', 1)[0] + '.log', 'w') as f_out:
                f_out.write(stdout)
            if callable(self.check_success) and self.check_success(job):
                # Terminate all remaining jobs here
                pass

我的Queue设置在这里:

class LocalJobServer(object):

    @staticmethod
    def sub(command, check_success=None, directory=None, nproc=1, permit_nonzero=False, time=None, *args, **kwargs):
        if check_success and not callable(check_success):
            msg = "check_success option requires a callable function/object: {0}".format(check_success)
            raise ValueError(msg)

        # Create a new queue
        queue = multiprocessing.Queue()
        # Create workers equivalent to the number of jobs
        workers = []
        for _ in range(nproc):
            wp = Worker(queue, check_success=check_success, directory=directory, permit_nonzero=permit_nonzero)
            wp.start()
            workers.append(wp)
        # Add each command to the queue
        for cmd in command:
            queue.put(cmd, timeout=time)
        # Stop workers from exiting without completion
        for _ in range(nproc):
            queue.put(None)
        for wp in workers:
            wp.join()

函数调用mbkit.dispatch.cexectools.cexec()subprocess.Popen的包装,并返回p.stdout

Worker课程中,我编写了条件以检查作业是否成功,并尝试使用Queue循环清空while中的剩余作业,即我{ {1}}函数看起来像这样:

Worker.run()

虽然这有时会起作用,但它通常会死锁,我唯一的选择是def run(self): for job in iter(self.queue.get, None): stdout = mbkit.dispatch.cexectools.cexec([job], directory=self.directory, permit_nonzero=self.permit_nonzero) with open(job.rsplit('.', 1)[0] + '.log', 'w') as f_out: f_out.write(stdout) if callable(self.check_success) and self.check_success(job): break while not self.queue.empty(): self.queue.get() 。我知道Ctrl-C是不可靠的,因此我的问题。

关于如何实现这种提前终止功能的任何建议?

2 个答案:

答案 0 :(得分:1)

这里没有死锁。它仅与multiprocessing.Queue的行为相关联,因为默认情况下get方法是阻止的。因此,当您在空队列上调用get时,调用会停止,等待下一个元素准备就绪。你可以看到你的一些工作人员会失速,因为当你使用你的循环while not self.queue.empty()清空它时,你删除了所有None哨兵,你的一些工作人员会阻止空Queue ,就像在这段代码中一样:

from multiprocessing import Queue
q = Queue()
for e in iter(q.get, None):
    print(e)

要在队列为空时收到通知,您需要使用非阻塞调用。例如,您可以使用q.get_nowait,或在q.get(timeout=1)中使用超时。当队列为空时,两者都抛出multiprocessing.queues.Empty异常。因此,您应该通过以下内容替换Worker for job in iter(...):循环:

while not queue.empty():
    try:
        job = queue.get(timeout=.1)
    except multiprocessing.queues.Empty:
        continue
    # Do stuff with your job

如果你不想在任何时候陷入困境。

对于同步部分,我建议使用同步原语,例如multiprocessing.Conditionmultiprocessing.Event。这比他们为此目的设计的价值更清晰。这样的事情应该有所帮助

def run(self):
    while not queue.empty():
        try:
            job = queue.get(timeout=.1)
        except multiprocessing.queues.Empty:
            continue
        if self.event.is_set():
            continue
        stdout = mbkit.dispatch.cexectools.cexec([job], directory=self.directory, permit_nonzero=self.permit_nonzero)
        with open(job.rsplit('.', 1)[0] + '.log', 'w') as f_out:
            f_out.write(stdout)
        if callable(self.check_success) and self.check_success(job):
            self.event.set()
    print("Worker {} terminated cleanly".format(self.name))

event = multiprocessing.Event()

请注意,也可以使用multiprocessing.Pool来避免处理队列和工作人员。但是,由于您需要一些同步原语,因此设置可能会有点复杂。这样的事情应该有效:

 def worker(job, success, check_success=None, directory=None, permit_nonzero=False):
      if sucess.is_set():
          return False
      stdout = mbkit.dispatch.cexectools.cexec([job], directory=self.directory, permit_nonzero=self.permit_nonzero)
      with open(job.rsplit('.', 1)[0] + '.log', 'w') as f_out:
          f_out.write(stdout)
      if callable(self.check_success) and self.check_success(job):
          success.set()
      return True

# ......
# In the class LocalJobServer
# .....

def sub(command, check_success=None, directory=None, nproc=1, permit_nonzero=False):

    mgr = multiprocessing.Manager()
    success = mgr.Event()

    pool = multiprocessing.Pool(nproc)
    run_args = [(cmd, success, check_success, directory, permit_nonzero)]
    result = pool.starmap(worker, run_args)

    pool.close()
    pool.join()

请注意,我使用的是Manager,因为您无法直接将multiprocessing.Event作为参数传递。您还可以使用initializer的{​​{1}}和initargs参数在每个工作人员中发起全局Pool事件,并避免依赖success但稍微有点Manager更复杂。

答案 1 :(得分:0)

这可能不是最佳解决方案,并且非常感谢任何其他建议,但我设法解决了这个问题:

class Worker(multiprocessing.Process):
    """Simple manual worker class to execute jobs in the queue"""

    def __init__(self, queue, success, check_success=None, directory=None, permit_nonzero=False):
        super(Worker, self).__init__()
        self.check_success = check_success
        self.directory = directory
        self.permit_nonzero = permit_nonzero
        self.success = success
        self.queue = queue

    def run(self):
        """Method representing the process's activity"""
        for job in iter(self.queue.get, None):
            if self.success.value:
                continue
            stdout = mbkit.dispatch.cexectools.cexec([job], directory=self.directory, permit_nonzero=self.permit_nonzero)
            with open(job.rsplit('.', 1)[0] + '.log', 'w') as f_out:
                f_out.write(stdout)
            if callable(self.check_success) and self.check_success(job):
                self.success.value = int(True)
            time.sleep(1)


class LocalJobServer(object):
    """A local server to execute jobs via the multiprocessing module"""

    @staticmethod
    def sub(command, check_success=None, directory=None, nproc=1, permit_nonzero=False, time=None, *args, **kwargs):
        if check_success and not callable(check_success):
            msg = "check_success option requires a callable function/object: {0}".format(check_success)
            raise ValueError(msg)

        # Create a new queue
        queue = multiprocessing.Queue()
        success = multiprocessing.Value('i', int(False))
        # Create workers equivalent to the number of jobs
        workers = []
        for _ in range(nproc):
            wp = Worker(queue, success, check_success=check_success, directory=directory, permit_nonzero=permit_nonzero)
            wp.start()
            workers.append(wp)
        # Add each command to the queue
        for cmd in command:
            queue.put(cmd)
        # Stop workers from exiting without completion
        for _ in range(nproc):
            queue.put(None)
        # Start the workers
        for wp in workers:
            wp.join(time)

基本上我正在创建Value并将其提供给每个Process。将作业标记为成功后,此变量将更新。 if self.success.value: continue中的每个Process都会检查我们是否成功,如果成功,只需迭代Queue中的剩余作业,直到空。

需要time.sleep(1)电话来说明流程之间潜在的同步延迟。这当然不是最有效的方法,但它有效。