我正在尝试将一个dask数组存储在一个zarr文件中。
当dask数组具有定义的形状时,我设法做到了。
import dask
import dask.array as da
import numpy as np
from tempfile import TemporaryDirectory
import zarr
np_array = np.random.randint(1, 10, size=1000)
array = da.from_array(np_array)
with TemporaryDirectory() as tmpdir:
delayed = da.to_zarr(array, url=tmpdir,
compute=False, component='/data')
dask.compute(delayed)
z_object = zarr.open_group(tmpdir, mode='r')
assert np.all(np_array == z_object.data[:])
但是,如果我对dask数组执行了任何操作,形状就会丢失,并且会抱怨形状中的Nans。
# this will fail
np_array = np.random.randint(1, 10, size=1000)
array = da.from_array(np_array)
array = array[array > 5]
with TemporaryDirectory() as tmpdir:
delayed = da.to_zarr(array, url=tmpdir,
compute=False, component='/data')
dask.compute(delayed)
z_object = zarr.open_group(tmpdir, mode='r')
assert np.all(np_array[np_array > 5] == z_object.data[:])
这是引发的错误:
Traceback (most recent call last):
File "/home/peio/devel/variation/variation6/variation6/tests/test_zarr.py", line 38, in <module>
without_shape()
File "/home/peio/devel/variation/variation6/variation6/tests/test_zarr.py", line 29, in without_shape
compute=False, component='/data')
File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/dask/array/core.py", line 2808, in to_zarr
**kwargs
File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/creation.py", line 120, in create
chunk_store=chunk_store, filters=filters, object_codec=object_codec)
File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/storage.py", line 323, in init_array
object_codec=object_codec)
File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/storage.py", line 343, in _init_array_metadata
shape = normalize_shape(shape) + dtype.shape
File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/util.py", line 58, in normalize_shape
shape = tuple(int(s) for s in shape)
File "/home/peio/devel/variation/pyenv3/lib/python3.7/site-packages/zarr/util.py", line 58, in <genexpr>
shape = tuple(int(s) for s in shape)
ValueError: cannot convert float NaN to integer
有没有一种方法可以将形状未知的dask数组存储到一个zarr文件中?
谢谢!
答案 0 :(得分:0)
Zarr期望块形状是统一的,并且事先已知。 Dask目前通过重新排列数组以使其统一来促进此操作。但是array[array > 5]
创建的Dask数组具有未知的块形状。因此,由于不存在所需的信息,因此无法预先将其重新统一为统一的。也就是说,我们可以explain this better。
可以使用返回已知块形状的Dask操作来解决此问题(如David所建议的)。或者,可以在存储(at the cost of computing)之前确定块的形状。我们也可以讨论extending Zarr to handle this case,但这是一个较长期的解决方案。