Question

我有一个2D NumPy数组，它很庞大。我有一些计算机内存，这不是很大。阵列的单个副本紧密地适合计算机内存。这个阵列的第二个副本让计算机跪下了。

在我将矩阵切割成更小，更易于管理的块之前，我需要向它添加几行并删除一些。幸运的是，我需要删除更多行而不是添加新行，所以理论上这可以全部就地完成。我正在努力实现这一目标，但我很好奇你们有什么建议可以给我。

到目前为止的计划：

制作要删除的行列表
制作要添加的行矩阵
将要删除的行替换为要添加的行（逐个，不能在这里使用花哨的索引？）
将仍需要移除的任何行移至矩阵的末尾
在矩阵上调用.resize()以在内存中调整其大小

特别是第4步很难有效实施。

代码：

import numpy as np

n_rows = 100
n_columns = 1000000
n_rows_to_drop = 20
n_rows_to_add = 10

# Init huge array
data = np.random.rand(n_rows, n_columns)

# Some rows we drop
to_drop = np.arange(n_rows)
np.random.shuffle(to_drop)
to_drop = to_drop[:n_rows_to_drop]


# Some rows we add
new_data = np.random.rand(n_rows_to_add, n_columns)

# Start replacing rows with new rows
for new_data_idx, to_drop_idx in enumerate(to_drop):
    if new_data_idx >= n_rows_to_add:
        break  # no more new data to add

    # Replace a row to drop with a new row
    data[to_drop_idx] = new_data[new_data_idx]

# These should still be dropped
to_drop = to_drop[n_rows_to_add:]
to_drop.sort()

# Make a list of row indices to keep, last rows first
to_keep = set(range(n_rows)) - set(to_drop)
to_keep = list(to_keep)
to_keep.sort()
to_keep = to_keep[::-1]

# Replace rows to drop with rows at the end of the matrix
for to_drop_idx, to_keep_idx in zip(to_drop, to_keep):
    if to_drop_idx > to_keep_idx:
        # All remaining rows to drop are at the end of the matrix
        break
    data[to_drop_idx] = data[to_keep_idx]

# Resize matrix in memory
data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)

这似乎有效，但有没有办法让这更优雅/更有效？有没有办法检查是否在某个时刻制作了巨大阵列的副本？

Answer 1

这似乎与您的代码执行相同，但稍微简短一些。我相对确定这里没有大数组的副本 - 花哨的索引将与视图一起使用。

import numpy as np

n_rows = 100
n_columns = 100000
n_rows_to_drop = 20
n_rows_to_add = 10

# Init huge array
data = np.random.rand(n_rows, n_columns)

# Some rows we drop
to_drop = np.random.randint(0, n_rows, n_rows_to_drop)
to_drop = np.unique(to_drop)

# Some rows we add
new_data = np.random.rand(n_rows_to_add, n_columns)

# Start replacing rows with new rows
data[to_drop[:n_rows_to_add]] = new_data

# These should still be dropped
to_drop = to_drop[:n_rows_to_add]

# Make a list of row indices to keep, last rows first
to_keep = np.setdiff1d(np.arange(n_rows), to_drop, assume_unique=True)[-n_rows_to_add:]

# Replace rows to drop with rows at the end of the matrix
for to_drop_i, to_keep_i in zip(to_drop, to_keep):
    data[to_drop_i] = data[to_keep_i]

# Resize matrix in memory
data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)

高效组合就地添加/删除巨大的2D numpy阵列的行

1 个答案: