假设我生成的数组的形状取决于某些计算,例如:
>>> import dask.array as da
>>> a = da.random.normal(size=(int(1e6), 10))
>>> a = a[a.mean(axis=1) > 0]
>>> a.shape
(nan, 10)
>>> a.chunks
((nan, nan, nan, nan, nan), (10,))
>>> a.chunksize
(nan, 10)
nan
是预期的。当我将计算结果保留在敏捷工作人员上时,我会假定可以丢失此丢失的元数据,但显然并非如此:
>>> a_persisted = a.persist()
>>> a_persisted.chunks
((nan, nan, nan, nan, nan), (10,))
>>> a_persisted.chunksize
(nan, 10)
>>> a_persisted.shape
(nan, 10)
如果我尝试强制重新压缩,我会得到:
>>> a_persisted.rechunk("auto")
Traceback (most recent call last):
File "<ipython-input-26-31162de022a0>", line 1, in <module>
a_persisted.rechunk("auto")
File "/home/ogrisel/code/dask/dask/array/core.py", line 1647, in rechunk
return rechunk(self, chunks, threshold, block_size_limit)
File "/home/ogrisel/code/dask/dask/array/rechunk.py", line 226, in rechunk
dtype=x.dtype, previous_chunks=x.chunks)
File "/home/ogrisel/code/dask/dask/array/core.py", line 1872, in normalize_chunks
chunks = auto_chunks(chunks, shape, limit, dtype, previous_chunks)
File "/home/ogrisel/code/dask/dask/array/core.py", line 1949, in auto_chunks
raise ValueError("Can not perform automatic rechunking with unknown "
ValueError: Can not perform automatic rechunking with unknown (nan) chunk sizes
用工作人员已经计算出的块的实际大小来更新数组元数据的惯用方式是什么?
我可以使用以下方法非常便宜地计算它们:
>>> dask.compute([chunk.shape for chunk in a_persisted.to_delayed().ravel()])
([(100108, 10), (99944, 10), (99545, 10), (99826, 10), (100099, 10)],)
我的问题是如何获取具有相同信息块的.shape
,.chunk
和.chunksize
属性(不包含nan)的新dask数组。
>>> dask.__version__
'1.1.0+9.gb1fef05'
答案 0 :(得分:1)
今天没有很好的解决方案,但是可以。如果不存在,我建议提出一个问题。这是普遍要求的功能。
编辑:在此处进行跟踪:https://github.com/dask/dask/issues/3293
答案 1 :(得分:1)
类似的外观很快将在dask数组(https://github.com/dask/dask/issues/3293)中内部解决。在此之前,这是我使用的解决方法:
import dask.array as da
import dask.dataframe as dd
a = da.random.normal(size=(int(1e6), 10))
a = dd.from_dask_array(a[a.mean(axis=1) >0],columns=np.arange(a.shape[1])).to_dask_array(lengths=True).persist()
print(a.chunks)
print(a.shape)
((100068, 100157, 100279, 100446, 99706), (10,))
(500656, 10)