Question

我想执行N = 1000引导，并替换网格数据。一次计算大约需要0.5s。我可以访问具有48个内核的超级计算机专用节点。因为重采样是相互独立的，所以我天真地希望将工作负载分配到所有或至少多个内核上，并使性能提高0.8 * ncores。但是我不明白。

我仍然缺乏对敏捷的理解。基于Best practices in setting number of dask workers，我使用：

from dask.distributed import Client
client = Client(processes=False, threads_per_worker=8, n_workers=6, memory_limit=‘32GB')

我也尝试过SLURMCluster，但我想我首先需要了解自己的工作，然后进行扩展。

我的MWE：

创建样本数据
要应用的写入功能
编写重采样初始化函数
以引导程序（= N）作为参数编写引导程序功能：请参阅下面的许多实现
执行自举

import dask
import numpy as np
import xarray as xr
from dask.distributed import Client

inits = np.arange(50)
lats = np.arange(96)
lons = np.arange(192)
data = np.random.rand(len(inits), len(lats), len(lons))
a = xr.DataArray(data,
                        coords=[inits, lats, lons],
                        dims=['init', 'lat', 'lon'])

data = np.random.rand(len(inits), len(lats), len(lons))
b = xr.DataArray(data,
                        coords=[inits, lats, lons],
                        dims=['init', 'lat', 'lon'])

def func(a,b, dim='init'):
    return (a-b).std(dim)

bootstrap=96

def resample(a):
    smp_init = np.random.choice(inits, len(inits))
    smp_a = a.sel(init=smp_init)
    smp_a['init'] = inits
    return smp_a


# serial function
def bootstrap_func(bootstrap=bootstrap):
    res = (func(resample(a),b) for _ in range(bootstrap))
    res = xr.concat(res,'bootstrap')
    # leave out quantile because not issue here yet
    #res_ci = res.quantile([.05,.95],'bootstrap')
    return res


@dask.delayed
def bootstrap_func_delayed_decorator(bootstrap=bootstrap):
    return bootstrap_func(bootstrap=bootstrap)


def bootstrap_func_delayed(bootstrap=bootstrap):
    res = (dask.delayed(func)(resample(a),b) for _ in range(bootstrap))
    res = xr.concat(dask.compute(*res),'bootstrap')
    #res_ci = res.quantile([.05,.95],'bootstrap')
    return res

for scheduler in ['synchronous','distributed','multiprocessing','processes','single-threaded','threads']:
    print('scheduler:',scheduler)

    def bootstrap_func_delayed_processes(bootstrap=bootstrap):
        res = (dask.delayed(func)(resample(a),b) for _ in range(bootstrap))
        res = xr.concat(dask.compute(*res, scheduler=scheduler),'bootstrap')
        res = res.quantile([.05,.95],'bootstrap')
        return res

    %time c = bootstrap_func_delayed_processes()

以下结果来自我的4核笔记本电脑。但是在超级计算机上，我也看不到加速，而是降低了50％。

序列结果：

%time c = bootstrap_func()
CPU times: user 814 ms, sys: 58.7 ms, total: 872 ms
Wall time: 862 ms

并行结果：

%time c = bootstrap_func_delayed_decorator().compute()
CPU times: user 96.2 ms, sys: 50 ms, total: 146 ms
Wall time: 906 ms

从循环并行化的结果：

scheduler: synchronous
CPU times: user 2.57 s, sys: 330 ms, total: 2.9 s
Wall time: 2.95 s
scheduler: distributed
CPU times: user 4.51 s, sys: 2.74 s, total: 7.25 s
Wall time: 8.86 s
scheduler: multiprocessing
CPU times: user 4.18 s, sys: 2.53 s, total: 6.71 s
Wall time: 7.95 s
scheduler: processes
CPU times: user 3.97 s, sys: 2.1 s, total: 6.07 s
Wall time: 7.39 s
scheduler: single-threaded
CPU times: user 2.26 s, sys: 275 ms, total: 2.54 s
Wall time: 2.47 s
scheduler: threads
CPU times: user 2.84 s, sys: 341 ms, total: 3.18 s
Wall time: 2.66 s

预期结果： -加速（.8 * ncores）

其他注意事项： -我还检查了是否应该对数据进行分块。样本块也太多。分块的数组需要更长的时间。

我的问题： -我对dask并行化有什么误解？ -这样的客户端设置没有用吗？ -我实现了dask.delayed不够聪明吗？ -我的串行功能是否已经因为dask并行执行？我认为不是。

Answer 1

我终于解决了这个问题。发布此挑战时，我显然不了解它的几个方面：

我在具有两个物理核心的笔记本电脑上运行计时。这在CPU受限的问题中不允许太多并行化。现在，我在具有48个逻辑CPU的节点上运行了该
我应该考虑一下算法的哪些部分很容易parallelizable，哪些部分不容易。只有到那时，我才能相应地进行分块。

在此处查看我的解决方案：https://gist.github.com/aaronspring/118abd7b9bf81e555b1fced42eef427f

改变游戏规则的人。最初发布的代码：

我对维度（此处为x）中未涉及的功能（使用time的功能）进行了分块
我仍然如上所述使用客户端：Best practices in setting number of dask workers
我只尝试并行化迭代部分。分位数方法是在内存中完成的。

结论：它比预期的要简单。要点显示了一个使用dask.delayed和dask.futures的实现，但是在我的用例中甚至不需要。首先尝试理解并行性https://realpython.com/python-concurrency/，并阅读dask文档https://dask.org/。

Answer 2

多维索引更快的解决方案

https://xskillscore.readthedocs.io/en/latest/api/xskillscore.core.resampling.resample_iterations_idx.html#xskillscore.core.resampling.resample_iterations_idx

并行引导程序，替换为xarray / dask

2 个答案: