Question

我有100-1000个时间序列路径，并且想并行化一个相当昂贵的模拟。但是，我正在使用的库在极少数情况下会挂起，因此我想使其对这些问题更可靠。这是当前设置：

with Pool() as pool:
    res = pool.map_async(simulation_that_occasionally_hangs, (p for p in paths))
    all_costs = res.get()

我知道get()有一个timeout参数，但是如果我理解正确，它可以在1000条路径的整个过程中工作。我想做的是检查任何单模拟是否花费了超过5分钟的时间（正常路径需要4秒），如果是，则停止该路径并继续get()休息。

编辑：

pebble中的测试超时

def fibonacci(n):
    if n == 0: return 0
    elif n == 1: return 1
    else: return fibonacci(n - 1) + fibonacci(n - 2)


def main():
    with ProcessPool() as pool:
        future = pool.map(fibonacci, range(40), timeout=10)
        iterator = future.result()

        all = []
        while True:
            try:
                all.append(next(iterator))
            except StopIteration:
                break
            except TimeoutError as e:
                print(f'function took longer than {e.args[1]} seconds')

        print(all)

错误：

RuntimeError: I/O operations still in flight while destroying Overlapped object, the process may crash
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\anaconda3\lib\multiprocessing\spawn.py", line 99, in spawn_main
    new_handle = reduction.steal_handle(parent_pid, pipe_handle)
  File "C:\anaconda3\lib\multiprocessing\reduction.py", line 87, in steal_handle
    _winapi.DUPLICATE_SAME_ACCESS | _winapi.DUPLICATE_CLOSE_SOURCE)
PermissionError: [WinError 5] Access is denied

Answer 1

最简单的方法可能是在单独的子流程中运行每个繁重的模拟，而父流程则对其进行监视。具体来说：

def risky_simulation(path):
    ...

def safe_simulation(path):
    p = multiprocessing.Process(target=risky_simulation, args=(path,))
    p.start()
    p.join(timeout)  # Your timeout here
    p.kill()  # or p.terminate()
    # Here read and return the output of the simulation
    # Can be from a file, or using some communication object
    # between processes, from the `multiprocessing` module

with Pool() as pool:
    res = pool.map_async(safe_simulation, paths)
    all_costs = res.get()

注意：

如果模拟可能挂起，您可能希望在单独的进程中运行它（即Process对象不应是线程），因为取决于完成方式，它可能会捕获GIL。
此解决方案仅将池用于直接子流程，但将计算卸载到新流程。我们还可以确保计算共享一个池，但这会导致代码更丑陋，因此我跳过了它。

Answer 2

pebble库旨在解决此类问题。它透明地处理作业超时和失败（例如C库崩溃）。

您可以查看documentation示例来了解如何使用它。它具有与concurrent.futures类似的界面。

多处理对偶发性故障的鲁棒性

2 个答案: