我正在尝试通过减少其自身来标准化一个dask数组(例如b = a / a.sum()
,其中a
和b
是dask数组)。计算此归一化数组将首先计算出知道原始数组所需的所有内容,然后才计算除法,因此如果内存不足,则会溢出到磁盘上。
代码示例:
from dask.distributed import Client
from dask import arry as da
# Create 1000 MB array full of 1's of with chunks of 50MB
a = da.ones(shape=(1/8 * 1000e6, 1), chunks=(1/8 * 50e6, 1))
# Create normalized array with sum = 1
b = a / a.sum()
# Create cluster to small for all of a or b at once
client = Client(n_workers=1, threads_per_worker=1, memory_limit=500e6)
# Compute sum of b (Spills to disk)
print(b.sum().compute())
是否存在以下内容?
b = a / same_as_a_but_different_tasks.sum()
答案 0 :(得分:0)
我通过复制数组并重命名顶层的所有任务来解决此问题:
from copy import deepcopy
def copy_with_renamed_top_layer(a, prepend_name="copy-of-"):
# copy array and dask
b = a.copy()
b.dask = deepcopy(b.dask)
# get new name
orig_name = a.name
new_name = prepend_name + orig_name
# rename dependencies
b.dask.dependencies[new_name] = b.dask.dependencies.pop(orig_name)
# rename tasks of uppermost layer
b.dask.layers[new_name] = b.dask.layers.pop(orig_name)
b.dask.layers[new_name] = {
(new_name, ) + k[1:]: v
for k, v in b.dask.layers[new_name].items()
}
# rename array
b.name = new_name
return b
# Create 1000 MB array full of 1's of with chunks of 50MB
a = da.ones(shape=(1/8 * 1000e6, 1), chunks=(1/8 * 50e6, 1))
# copy and rename uppermost layer
a_copy = copy_with_renamed_top_layer(a)
# Create normalized array with sum = 1
b = a / a_copy.sum()
但是,这是一个非常脆弱的解决方案,因为它依赖于当前的内部API。