我正在尝试将代码重组为使用Dask而不是NumPy进行大型数组计算。但是,我在Dask的运行时性能方面苦苦挣扎:
In[15]: import numpy as np
In[16]: import dask.array as da
In[17]: np_arr = np.random.rand(10, 10000, 10000)
In[18]: da_arr = da.from_array(np_arr, chunks=(-1, 'auto', 'auto'))
In[19]: %timeit np.mean(np_arr, axis=0)
1 loop, best of 3: 2.59 s per loop
In[20]: %timeit da_arr.mean(axis=0).compute()
1 loop, best of 3: 4.23 s per loop
我看过类似的问题(why is dot product in dask slower than in numpy),但是尝试使用块大小并没有帮助。我将主要使用与上述大小大致相同的数组。是否建议对此类数组使用NumPy而不是Dask,或者我可以调整某些内容?我还尝试过使用Client
中的dask.distributed
,并以16个进程和每个进程4个线程(16个核心CPU)的方式启动它,但这使情况变得更糟。
预先感谢!
编辑:
我玩过Dask和分布式处理。数据传输(转储数组和结果检索)似乎是主要的限制/问题,而计算速度却非常快(436ms,而9.51s)。但是,即使对于client.compute()
,挂墙时间也比do_stuff(data)
大(12.1s)。总体而言,这可以改善数据传输吗?
In[3]: import numpy as np
In[4]: from dask.distributed import Client, wait
In[5]: from dask import delayed
In[6]: import dask.array as da
In[7]: client = Client('address:port')
In[8]: client
Out[8]: <Client: scheduler='tcp://address:port' processes=4 cores=16>
In[9]: data = np.random.rand(400, 100, 10000)
In[10]: %time [future] = client.scatter([data])
CPU times: user 8.36 s, sys: 5.08 s, total: 13.4 s
Wall time: 24.5 s
In[11]: x = da.from_delayed(delayed(future), shape=data.shape, dtype=data.dtype)
In[12]: x = x.rechunk(chunks=('auto', 'auto', 'auto'))
In[13]: x = client.persist(x)
In[14]: {w: len(keys) for w, keys in client.has_what().items()}
Out[14]:
{'tcp://address:port': 65,
'tcp://address:port': 0,
'tcp://address:port': 0,
'tcp://address:port': 0}
In[15]: client.rebalance(x)
In[16]: {w: len(keys) for w, keys in client.has_what().items()}
Out[16]:
{'tcp://address:port': 17,
'tcp://address:port': 16,
'tcp://address:port': 16,
'tcp://address:port': 16}
In[17]: def do_stuff(arr):
... arr = arr/3. + arr**2 - arr**(1/2)
... arr[arr >= 0.5] = 1
... return arr
...
In[18]: %time future_compute = client.compute(do_stuff(x)); wait(future_compute)
Matplotlib support failed
CPU times: user 387 ms, sys: 49.5 ms, total: 436 ms
Wall time: 12.1 s
In[19]: future_compute
Out[19]: <Future: status: finished, type: ndarray, key: finalize-54eb04bbe03eee8af686fd43b41eb161>
In[21]: %timeit future_compute.result()
1 loop, best of 3: 19.4 s per loop
In[21]: %time do_stuff(data)
CPU times: user 4.49 s, sys: 5.02 s, total: 9.51 s
Wall time: 9.5 s