Question

我想从重叠的dask数组计算中返回一个dask数据帧，其中每个块的计算都返回一个pandas数据帧。下面的示例显示了一种执行此操作的方法，为演示目的对其进行了简化。如果我传递了相关的块键和块信息，我发现da.overlap.overlap和to_delayed().ravel()的组合可以完成工作。

编辑：感谢@AnnaM，他在原始帖子中发现了错误，然后将其变得通用！基于她的评论，我包括了代码的更新版本。另外，为了回应Anna对内存使用的兴趣，我验证了这似乎并没有超出天真的预期。

def extract_features_generalized(chunk, offsets, depth, columns):
    shape = np.asarray(chunk.shape)
    offsets = np.asarray(offsets)
    depth = np.asarray(depth)
    coordinates = np.stack(np.nonzero(chunk)).T     
    keep = ((coordinates >= depth) & (coordinates < (shape - depth))).all(axis=1)    
    data = coordinates + offsets - depth
    df = pd.DataFrame(data=data, columns=columns)
    return df[keep]

def my_overlap_generalized(data, chunksize, depth, columns, boundary):
    data = data.rechunk(chunksize)
    data_overlapping_chunks = da.overlap.overlap(data, depth=depth, boundary=boundary)

    dfs = []
    for block in data_overlapping_chunks.to_delayed().ravel():
        offsets = np.array(block.key[1:]) * np.array(data.chunksize)
        df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets, 
                                                              depth=depth, columns=columns)
        dfs.append(df_block)

    return dd.from_delayed(dfs)

data = np.zeros((2,4,8,16,16))  
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1

arr = da.from_array(data)
df = my_overlap_generalized(arr, 
                            chunksize=(-1,-1,-1,8,8), 
                            depth=(0,0,0,2,2), 
                            columns=['r', 'c', 'z', 'y', 'x'],
                            boundary=tuple(['reflect']*5))
df.compute().reset_index()

-原始帖子的其余部分，包括原始错误-

我的示例仅将xy重叠，但是很容易概括。有什么低于理想的或者可以做得更好的东西吗？因为它依赖于可能更改的低级信息（例如，块键），所以可能会破坏任何东西吗？

def my_overlap(data, chunk_xy, depth_xy):
    data = data.rechunk((-1,-1,-1, chunk_xy, chunk_xy))
    data_overlapping_chunks = da.overlap.overlap(data, 
                                                 depth=(0,0,0,depth_xy,depth_xy), 
                                                 boundary={3: 'reflect', 4: 'reflect'})

    dfs = []
    for block in data_overlapping_chunks.to_delayed().ravel():
        offsets = np.array(block.key[1:]) * np.array(data.chunksize)
        df_block = dask.delayed(extract_features)(block, offsets=offsets, depth_xy=depth_xy)
        dfs.append(df_block)

    # All computation is delayed, so downstream comptutions need to know the format of the data.  If the meta
    # information is not specified, a single computation will be done (which could be expensive) at this point
    # to infer the metadata.
    # This empty dataframe has the index, column, and type information we expect in the computation.
    columns = ['r', 'c', 'z', 'y', 'x']

    # The dtypes are float64, except for a small number of columns
    df_meta = pd.DataFrame(columns=columns, dtype=np.float64)
    df_meta = df_meta.astype({'c': np.int64, 'r': np.int64})
    df_meta.index.name = 'feature'

    return dd.from_delayed(dfs, meta=df_meta)

def extract_features(chunk, offsets, depth_xy):
    r, c, z, y, x = np.nonzero(chunk) 
    df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y+offsets[3]-depth_xy, 
                       'x': x+offsets[4]-depth_xy})
    df = df[(df.y > depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
            (df.z > depth_xy) & (df.z < (chunk.shape[4] - depth_xy))]
    return df

data = np.zeros((2,4,8,16,16))  # round, channel, z, y, x
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap(arr, chunk_xy=8, depth_xy=2)
df.compute().reset_index()

Answer 1

首先，感谢您发布代码。我正在解决类似的问题，这对我真的很有帮助。

在测试您的代码时，我发现allow-popups-to-escape-sandbox函数中存在一些错误，这些错误会阻止您的代码返回正确的索引。

这是更正的版本：

extract_features

更新后的代码现在返回设置为1的索引。

def extract_features(chunk, offsets, depth_xy):
    r, c, z, y, x = np.nonzero(chunk) 
    df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y, 'x': x})
    df = df[(df.y >= depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
            (df.x >= depth_xy) & (df.x < (chunk.shape[4] - depth_xy))]
    df['y'] = df['y'] + offsets[3] - depth_xy
    df['x'] = df['x'] + offsets[4] - depth_xy
    return df

为了比较，这是原始版本的输出：

   index  r  c  z  y  x
0      0  0  0  4  2  2
1      1  0  1  4  6  2
2      2  0  3  4  2  2
3      1  1  2  4  8  2

它返回第2行和第4行，每次两次。

发生这种情况的原因是index r c z y x 0 1 0 1 4 6 2 1 3 1 2 4 8 2 2 0 0 1 4 6 2 3 1 1 2 4 8 2函数中的三个错误：

首先添加偏移量并减去深度，然后过滤掉重叠的部分：需要交换订单
extract_features应该替换为df.y > depth_xy
df.y >= depth_xy应该替换为df.z，因为它是x维度重叠的部分

为进一步优化此效果，以下是该代码的通用版本，适用于任意数量的维度：

df.x

在不返回np数组的dask块上并行化计算的最佳方法？

1 个答案: