我有一个2D NumPy数组,它很庞大。我有一些计算机内存,这不是很大。 阵列的单个副本紧密地适合计算机内存。这个阵列的第二个副本让计算机跪下了。
在我将矩阵切割成更小,更易于管理的块之前,我需要向它添加几行并删除一些。幸运的是,我需要删除更多行而不是添加新行,所以理论上这可以全部就地完成。我正在努力实现这一目标,但我很好奇你们有什么建议可以给我。
到目前为止的计划:
.resize()
以在内存中调整其大小特别是第4步很难有效实施。
到目前为止代码:
import numpy as np
n_rows = 100
n_columns = 1000000
n_rows_to_drop = 20
n_rows_to_add = 10
# Init huge array
data = np.random.rand(n_rows, n_columns)
# Some rows we drop
to_drop = np.arange(n_rows)
np.random.shuffle(to_drop)
to_drop = to_drop[:n_rows_to_drop]
# Some rows we add
new_data = np.random.rand(n_rows_to_add, n_columns)
# Start replacing rows with new rows
for new_data_idx, to_drop_idx in enumerate(to_drop):
if new_data_idx >= n_rows_to_add:
break # no more new data to add
# Replace a row to drop with a new row
data[to_drop_idx] = new_data[new_data_idx]
# These should still be dropped
to_drop = to_drop[n_rows_to_add:]
to_drop.sort()
# Make a list of row indices to keep, last rows first
to_keep = set(range(n_rows)) - set(to_drop)
to_keep = list(to_keep)
to_keep.sort()
to_keep = to_keep[::-1]
# Replace rows to drop with rows at the end of the matrix
for to_drop_idx, to_keep_idx in zip(to_drop, to_keep):
if to_drop_idx > to_keep_idx:
# All remaining rows to drop are at the end of the matrix
break
data[to_drop_idx] = data[to_keep_idx]
# Resize matrix in memory
data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)
这似乎有效,但有没有办法让这更优雅/更有效?有没有办法检查是否在某个时刻制作了巨大阵列的副本?
答案 0 :(得分:1)
这似乎与您的代码执行相同,但稍微简短一些。我相对确定这里没有大数组的副本 - 花哨的索引将与视图一起使用。
import numpy as np
n_rows = 100
n_columns = 100000
n_rows_to_drop = 20
n_rows_to_add = 10
# Init huge array
data = np.random.rand(n_rows, n_columns)
# Some rows we drop
to_drop = np.random.randint(0, n_rows, n_rows_to_drop)
to_drop = np.unique(to_drop)
# Some rows we add
new_data = np.random.rand(n_rows_to_add, n_columns)
# Start replacing rows with new rows
data[to_drop[:n_rows_to_add]] = new_data
# These should still be dropped
to_drop = to_drop[:n_rows_to_add]
# Make a list of row indices to keep, last rows first
to_keep = np.setdiff1d(np.arange(n_rows), to_drop, assume_unique=True)[-n_rows_to_add:]
# Replace rows to drop with rows at the end of the matrix
for to_drop_i, to_keep_i in zip(to_drop, to_keep):
data[to_drop_i] = data[to_keep_i]
# Resize matrix in memory
data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)