dask,在每个worker上执行不可序列化的对象

时间:2016-11-03 13:27:28

标签: python multithreading serialization dask

我正在尝试执行以下图表:

graph

由以下代码生成:

StringName = replace(StringName, "$", "")

当我致电energies = [10, 20] system = delayed(make_non_serializable_oject)(x=1) trans = [delayed(some_function_that_uses_system)(system, energy) for energy in energies] result = delayed(list)(trans) result.visualize() 时,计算永远不会结束。

调用result.compute()result.compute(get=dask.async.get_sync)都可以正常工作。但是result.compute(dask.threaded.get)没有,并生成以下错误:

result.compute(dask.multiprocessing.get)

使用--------------------------------------------------------------------------- RemoteError Traceback (most recent call last) <ipython-input-70-b5c8f2a1c6f6> in <module>() ----> 1 result.compute(get=dask.multiprocessing.get) /home/bnijholt/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(self, **kwargs) 76 Extra keywords to forward to the scheduler ``get`` function. 77 """ ---> 78 return compute(self, **kwargs)[0] 79 80 @classmethod /home/bnijholt/anaconda3/lib/python3.5/site-packages/dask/base.py in compute(*args, **kwargs) 169 dsk = merge(var.dask for var in variables) 170 keys = [var._keys() for var in variables] --> 171 results = get(dsk, keys, **kwargs) 172 173 results_iter = iter(results) /home/bnijholt/anaconda3/lib/python3.5/site-packages/dask/multiprocessing.py in get(dsk, keys, num_workers, func_loads, func_dumps, optimize_graph, **kwargs) 81 # Run 82 result = get_async(apply_async, len(pool._pool), dsk3, keys, ---> 83 queue=queue, get_id=_process_get_id, **kwargs) 84 finally: 85 if cleanup: /home/bnijholt/anaconda3/lib/python3.5/site-packages/dask/async.py in get_async(apply_async, num_workers, dsk, result, cache, queue, get_id, raise_on_exception, rerun_exceptions_locally, callbacks, **kwargs) 479 _execute_task(task, data) # Re-execute locally 480 else: --> 481 raise(remote_exception(res, tb)) 482 state['cache'][key] = res 483 finish_task(dsk, key, state, results, keyorder.get) RemoteError: --------------------------------------------------------------------------- Traceback (most recent call last): File "/home/bnijholt/anaconda3/lib/python3.5/multiprocessing/managers.py", line 228, in serve_client request = recv() File "/home/bnijholt/anaconda3/lib/python3.5/multiprocessing/connection.py", line 251, in recv return ForkingPickler.loads(buf.getbuffer()) File "kwant/graph/core.pyx", line 664, in kwant.graph.core.CGraph_malloc.__cinit__ (kwant/graph/core.c:8330) TypeError: __cinit__() takes exactly 6 positional arguments (0 given) --------------------------------------------------------------------------- Traceback --------- File "/home/bnijholt/anaconda3/lib/python3.5/site-packages/dask/async.py", line 273, in execute_task queue.put(result) File "<string>", line 2, in put File "/home/bnijholt/anaconda3/lib/python3.5/multiprocessing/managers.py", line 732, in _callmethod raise convert_to_error(kind, result) 我将在每个引擎上执行ipyparallel,这解决了该情况的问题。

我想使用make_non_serializable_oject进行并行计算,我该如何解决?

1 个答案:

答案 0 :(得分:3)

确保您的数据可以序列化

回溯中的此代码显示kwant库中的对象未自行序列化:

Traceback (most recent call last):
  File "/home/bnijholt/anaconda3/lib/python3.5/multiprocessing/managers.py", line 228, in serve_client
    request = recv()
  File "/home/bnijholt/anaconda3/lib/python3.5/multiprocessing/connection.py", line 251, in recv
    return ForkingPickler.loads(buf.getbuffer())
  File "kwant/graph/core.pyx", line 664, in kwant.graph.core.CGraph_malloc.__cinit__ (kwant/graph/core.c:8330)
TypeError: __cinit__() takes exactly 6 positional arguments (0 given)

这就是多处理和分布式调度程序失败的原因。 Dask需要能够序列化数据,以便在不同进程之间移动它。

解决此问题的最简单,最简洁的方法是改进数据的序列化。理想情况下,您可以通过改进kwant来实现此目的。你也可以通过dask的自定义序列化来管理这个,但那个&#39;目前可能还有更多的工作。

将数据保存在一个位置

好的,我们假设您无法改进序列化并需要将数据保持在原来的位置。这将限制您使用令人尴尬的并行工作流程(地图)。有两种解决方案:

  1. 使用fuse优化图书
  2. 明确跟踪任务运行的位置
  3. 保险丝

    您将创建一些不可序列化的数据,然后在其上运行内容,然后在其上运行计算,在尝试将其移回之前将其转换为可序列化的内容。只要调度程序决定永远不会自己移动数据,这就没问题了。您可以通过将所有这些任务融合到单个原子任务中来强制执行此操作。有关详细信息,请参阅optimization docs

    from dask.optimize import fuse
    bad_data = [f(...) for ...]
    good_data = [convert_to_serializable_data(bd) for bd in bad_data]
    dask.compute(good_data, optimizations=[fuse])
    

    准确指定每个计算应该在哪里生活

    请参阅Data Locality docs