懒惰的`Pool`

Question

我正在使用map_async创建一个包含4个工作程序的池。并为其提供要处理的图像文件列表[Set 1]。
有时，我需要取消两者之间的处理，以便我可以获取另一组已处理的文件[Set 2]。

所以一个例子是，我给map_async 1000个文件进行处理。然后要在处理了大约200个文件之后取消对剩余作业的处理。
此外，我想执行此取消操作而不破坏/终止池。这可能吗？

我不想终止池，因为在Windows上重新创建池是一个缓慢的过程（因为它使用的是“ spawn”而不是“ fork”）。而且我需要使用同一池来处理一组不同的图像文件[Set 2]。

# Putting job_set1 through processing. It may consist of 1000 images
cpu = multiprocessing.cpu_count()
pool = Pool(processes=cpu)
result = pool.map_async(job_set1, thumb_ts_list, chunksize=chunksize)

现在之间，我需要取消对此集合1的处理，然后移至其他集合（不选择等待所有1000张图像完成处理，但是我可以等待当前图像完成处理））

<Somehow cancel processing of job_set1>
result = pool.map_async(job_set2, thumb_ts_list, chunksize=chunksize)

Answer 1

现在是时候fundamental theorem of software engineering了：尽管multiprocessing.Pool并没有提供取消功能，但我们可以通过从精心制作的迭代器中读取Pool来添加它。但是，仅拥有yield列表中的值但在某些信号上停顿的生成器是不够的，因为Pool会急切地耗尽分配给它的任何生成器。因此，我们需要精心设计的非常可迭代。

懒惰的`Pool`

我们需要的通用工具是一种仅在有工作人员可用时（或最多在完成一项任务的情况下，才能构建任务，以防花费大量时间）为Pool构造任务的方法。基本思想是通过仅在任务完成时才增加信号量来减慢Pool的线程收集工作。（我们从imap_unordered的可观察到的行为中知道存在这样的线程。）

import multiprocessing
from threading import Semaphore

size=multiprocessing.cpu_count()  # or whatever Pool size to use

# How many workers are waiting for work?  Add one to buffer one task.
work=Semaphore(size)

def feed0(it):
  it=iter(it)
  try:
    while True:
      # Don't ask the iterable until we have a customer, in case better
      # instructions become available:
      work.acquire()
      yield next(it)
  except StopIteration: pass
  work.release()
def feed(p,f,it):
  import sys,traceback
  iu=p.imap_unordered(f,feed0(it))
  while True:
    try: x=next(iu)
    except StopIteration: return
    except Exception: traceback.print_exception(*sys.exc_info())
    work.release()
    yield x

try中的feed可以防止子代的失败打破信号灯的数量，但是请注意，它不能防止父代的失败。

可取消的迭代器

现在，我们可以实时控制Pool输入，从而使任何调度策略都变得简单明了。例如，这里类似于itertools.chain，但具有异步丢弃输入序列之一中任何剩余元素的功能：

import collections,queue

class Cancel:
  closed=False
  cur=()
  def __init__(self): self.data=queue.Queue() # of deques
  def add(self,d):
    d=collections.deque(d)
    self.data.put(d)
    return d
  def __iter__(self):
    while True:
      try: yield self.cur.popleft()
      except IndexError:
        self.cur=self.data.get()
        if self.cur is None: break
  @staticmethod
  def cancel(d): d.clear()
  def close(self): self.data.put(None)

尽管没有锁定，但是这是线程安全的（至少在CPython中），因为就Python检查而言，像deque.clear这样的操作是原子的（并且我们不单独检查self.cur是否为空））。

用法

使其中一种看起来像

pool=mp.Pool(size)
can=Cancel()
many=can.add(range(1000))
few=can.add(["some","words"])
can.close()
for x in feed(pool,assess_happiness,can):
  if happy_with(x): can.cancel(many)  # straight onto few, then out

当然add和close本身可能在循环中。

Answer 2

multiprocessing模块似乎没有取消的概念。您可以使用concurrent.futures.ProcessPoolExecutor包装器，并在有足够结果时取消挂起的期货。

这是一个示例，该示例从一组路径中选取10个JPEG，并取消了未决的期货，同时使以后的处理池可用：

import concurrent.futures


def interesting_path(path):
    """Gives path if is a JPEG else ``None``."""
    with open(path, 'rb') as f:
        if f.read(3) == b'\xff\xd8\xff':
            return path
        return None


def find_interesting(paths, count=10):
     """Yields count from paths which are 'interesting' by multiprocess task."""
    with concurrent.futures.ProcessPoolExecutor() as pool:
        futures = {pool.submit(interesting_path, p) for p in paths}
        print ('Started {}'.format(len(futures)))
        for future in concurrent.futures.as_completed(futures):
            res = future.result()
            futures.remove(future)
            if res is not None:
                yield res
                count -= 1
                if count == 0:
                    break
        cancelled = 0
        for future in futures:
            cancelled += future.cancel()
        print ('Cancelled {}'.format(cancelled))
        concurrent.futures.wait(futures)
        # Can still use pool here for more processing as needed

请注意，选择如何将工作分解为期货仍然很棘手，更大的设置会增加开销，但也意味着更少的工作浪费。这也可以很容易地适应Python 3.6异步语法。

多处理-在不破坏池的情况下取消池中的剩余作业

2 个答案:

懒惰的`Pool`

可取消的迭代器

用法

多处理-在不破坏池的情况下取消池中的剩余作业

2 个答案:

懒惰的Pool

可取消的迭代器

用法

懒惰的`Pool`