在不返回np数组的dask块上并行化计算的最佳方法?

时间:2020-03-23 18:48:19

标签: dask dask-delayed

我想从重叠的dask数组计算中返回一个dask数据帧,其中每个块的计算都返回一个pandas数据帧。下面的示例显示了一种执行此操作的方法,为演示目的对其进行了简化。如果我传递了相关的块键和块信息,我发现da.overlap.overlapto_delayed().ravel()的组合可以完成工作。

编辑: 感谢@AnnaM,他在原始帖子中发现了错误,然后将其变得通用!基于她的评论,我包括了代码的更新版本。另外,为了回应Anna对内存使用的兴趣,我验证了这似乎并没有超出天真的预期。

def extract_features_generalized(chunk, offsets, depth, columns):
    shape = np.asarray(chunk.shape)
    offsets = np.asarray(offsets)
    depth = np.asarray(depth)
    coordinates = np.stack(np.nonzero(chunk)).T     
    keep = ((coordinates >= depth) & (coordinates < (shape - depth))).all(axis=1)    
    data = coordinates + offsets - depth
    df = pd.DataFrame(data=data, columns=columns)
    return df[keep]

def my_overlap_generalized(data, chunksize, depth, columns, boundary):
    data = data.rechunk(chunksize)
    data_overlapping_chunks = da.overlap.overlap(data, depth=depth, boundary=boundary)

    dfs = []
    for block in data_overlapping_chunks.to_delayed().ravel():
        offsets = np.array(block.key[1:]) * np.array(data.chunksize)
        df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets, 
                                                              depth=depth, columns=columns)
        dfs.append(df_block)

    return dd.from_delayed(dfs)

data = np.zeros((2,4,8,16,16))  
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1

arr = da.from_array(data)
df = my_overlap_generalized(arr, 
                            chunksize=(-1,-1,-1,8,8), 
                            depth=(0,0,0,2,2), 
                            columns=['r', 'c', 'z', 'y', 'x'],
                            boundary=tuple(['reflect']*5))
df.compute().reset_index()

-原始帖子的其余部分,包括原始错误-

我的示例仅将xy重叠,但是很容易概括。有什么低于理想的或者可以做得更好的东西吗?因为它依赖于可能更改的低级信息(例如,块键),所以可能会破坏任何东西吗?

def my_overlap(data, chunk_xy, depth_xy):
    data = data.rechunk((-1,-1,-1, chunk_xy, chunk_xy))
    data_overlapping_chunks = da.overlap.overlap(data, 
                                                 depth=(0,0,0,depth_xy,depth_xy), 
                                                 boundary={3: 'reflect', 4: 'reflect'})

    dfs = []
    for block in data_overlapping_chunks.to_delayed().ravel():
        offsets = np.array(block.key[1:]) * np.array(data.chunksize)
        df_block = dask.delayed(extract_features)(block, offsets=offsets, depth_xy=depth_xy)
        dfs.append(df_block)

    # All computation is delayed, so downstream comptutions need to know the format of the data.  If the meta
    # information is not specified, a single computation will be done (which could be expensive) at this point
    # to infer the metadata.
    # This empty dataframe has the index, column, and type information we expect in the computation.
    columns = ['r', 'c', 'z', 'y', 'x']

    # The dtypes are float64, except for a small number of columns
    df_meta = pd.DataFrame(columns=columns, dtype=np.float64)
    df_meta = df_meta.astype({'c': np.int64, 'r': np.int64})
    df_meta.index.name = 'feature'

    return dd.from_delayed(dfs, meta=df_meta)

def extract_features(chunk, offsets, depth_xy):
    r, c, z, y, x = np.nonzero(chunk) 
    df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y+offsets[3]-depth_xy, 
                       'x': x+offsets[4]-depth_xy})
    df = df[(df.y > depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
            (df.z > depth_xy) & (df.z < (chunk.shape[4] - depth_xy))]
    return df

data = np.zeros((2,4,8,16,16))  # round, channel, z, y, x
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap(arr, chunk_xy=8, depth_xy=2)
df.compute().reset_index()

1 个答案:

答案 0 :(得分:1)

首先,感谢您发布代码。我正在解决类似的问题,这对我真的很有帮助。

在测试您的代码时,我发现allow-popups-to-escape-sandbox函数中存在一些错误,这些错误会阻止您的代码返回正确的索引。

这是更正的版本:

extract_features

更新后的代码现在返回设置为1的索引。

def extract_features(chunk, offsets, depth_xy):
    r, c, z, y, x = np.nonzero(chunk) 
    df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y, 'x': x})
    df = df[(df.y >= depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
            (df.x >= depth_xy) & (df.x < (chunk.shape[4] - depth_xy))]
    df['y'] = df['y'] + offsets[3] - depth_xy
    df['x'] = df['x'] + offsets[4] - depth_xy
    return df

为了比较,这是原始版本的输出:

   index  r  c  z  y  x
0      0  0  0  4  2  2
1      1  0  1  4  6  2
2      2  0  3  4  2  2
3      1  1  2  4  8  2

它返回第2行和第4行,每次两次。

发生这种情况的原因是 index r c z y x 0 1 0 1 4 6 2 1 3 1 2 4 8 2 2 0 0 1 4 6 2 3 1 1 2 4 8 2 函数中的三个错误:

  1. 首先添加偏移量并减去深度,然后过滤掉重叠的部分:需要交换订单
  2. extract_features应该替换为df.y > depth_xy
  3. df.y >= depth_xy应该替换为df.z,因为它是x维度重叠的部分

为进一步优化此效果,以下是该代码的通用版本,适用于任意数量的维度:

df.x