Question

简介

我有一个由42个平面组成的图像堆栈（ImgStack），每个平面有2048x2048像素，并且我用它来进行分析：

def All(ImgStack):
    some filtering
    more filtering

我确定使用dask（在我的计算机上）处理数组的最有效方法是使chunks=(21,256,256)。

当我运行map_blocks时：

now=time.time()
z=da.from_array(ImgStack,chunks=(21,256,256))
g=da.ghost.ghost(z, depth={0:10, 1:50,2:50},boundary={0: 'periodic',1:'periodic',2:'periodic'})
g2=g.map_blocks(All)
result = da.ghost.trim_internal(g2, {0: 10, 1: 50,2:50})
print('Time=',str(time.time()-now))

时间= 1.7090258598327637

而是当我运行map_overlap

时

now=time.time()
z=da.from_array(ImgStack,chunks=(21,256,256))
y=z.map_overlap(All,depth={0:10, 1:50,2:50},boundary={0: 'periodic', 1: 'periodic',2:'periodic'})
y.compute()
print('Time=',str(time.time()-now))

时间= 228.19104409217834

我想时差很大是由于在map_overlap中从dask.array转换为np.array，因为如果我将转换步骤添加到map_block脚本，则执行时间变得可比。

now=time.time()
z=da.from_array(ImgStack,chunks=(21,256,256))
g=da.ghost.ghost(z, depth={0:10, 1:50,2:50},boundary={0: 'periodic', 1: 'periodic',2:'periodic'})
g2=g.map_blocks(All)
result = da.ghost.trim_internal(g2, {0: 10, 1: 50,2:50})
I=np.array(result)
print('Time=',str(time.time()-now))

时间= 209.68917989730835

问题

所以最好的方法是保留dask.array，但是当我在h5文件上保存数据时会出现问题：

now=time.time()
result.to_hdf5('/Users/simone/Downloads/test.h5','/Dask2',compression='lzf')
print('Time=',str(time.time()-now))

时间= 243.1597340106964

但如果我保存相应的np.array

test=h5.File('/Users/simone/Downloads/test.h5','r+')
DT=test.require_group('NP')
DT.create_dataset('t', data=I,dtype=I.dtype,compression="lzf")
now=time.time()
print('Time=',str(time.time()-now))

时间=时间= 4.887580871582031e-05

问题

所以我希望能够在尽可能短的时间内运行过滤并保存数组。有没有办法加速dask.array的转换 - ＆gt; np.array（）或加速da.to_hdf5？

谢谢！任何评论将不胜感激。

Answer 1

在您的快速示例中，您实际上从未compute结果。一秒钟就是用于设置计算图的时间。对我来说，你的计算看起来真的需要200秒左右。

如果您想更好地了解占用时间，可以尝试使用dask profiler。

map_block和map_overlap之间处理时间的差异是由于dask.array到np.array的转换？

简介

问题

问题

1 个答案: