高效组合就地添加/删除巨大的2D numpy阵列的行

时间:2014-10-29 08:34:56

标签: python numpy

我有一个2D NumPy数组,它很庞大。我有一些计算机内存,这不是很大。 阵列的单个副本紧密地适合计算机内存。这个阵列的第二个副本让计算机跪下了。

在我将矩阵切割成更小,更易于管理的块之前,我需要向它添加几行并删除一些。幸运的是,我需要删除更多行而不是添加新行,所以理论上这可以全部就地完成。我正在努力实现这一目标,但我很好奇你们有什么建议可以给我。

到目前为止的计划:

  1. 制作要删除的行列表
  2. 制作要添加的行矩阵
  3. 将要删除的行替换为要添加的行(逐个,不能在这里使用花哨的索引?)
  4. 将仍需要移除的任何行移至矩阵的末尾
  5. 在矩阵上调用.resize()以在内存中调整其大小
  6. 特别是第4步很难有效实施。

    到目前为止

    代码:

    import numpy as np
    
    n_rows = 100
    n_columns = 1000000
    n_rows_to_drop = 20
    n_rows_to_add = 10
    
    # Init huge array
    data = np.random.rand(n_rows, n_columns)
    
    # Some rows we drop
    to_drop = np.arange(n_rows)
    np.random.shuffle(to_drop)
    to_drop = to_drop[:n_rows_to_drop]
    
    
    # Some rows we add
    new_data = np.random.rand(n_rows_to_add, n_columns)
    
    # Start replacing rows with new rows
    for new_data_idx, to_drop_idx in enumerate(to_drop):
        if new_data_idx >= n_rows_to_add:
            break  # no more new data to add
    
        # Replace a row to drop with a new row
        data[to_drop_idx] = new_data[new_data_idx]
    
    # These should still be dropped
    to_drop = to_drop[n_rows_to_add:]
    to_drop.sort()
    
    # Make a list of row indices to keep, last rows first
    to_keep = set(range(n_rows)) - set(to_drop)
    to_keep = list(to_keep)
    to_keep.sort()
    to_keep = to_keep[::-1]
    
    # Replace rows to drop with rows at the end of the matrix
    for to_drop_idx, to_keep_idx in zip(to_drop, to_keep):
        if to_drop_idx > to_keep_idx:
            # All remaining rows to drop are at the end of the matrix
            break
        data[to_drop_idx] = data[to_keep_idx]
    
    # Resize matrix in memory
    data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)
    

    这似乎有效,但有没有办法让这更优雅/更有效?有没有办法检查是否在某个时刻制作了巨大阵列的副本?

1 个答案:

答案 0 :(得分:1)

这似乎与您的代码执行相同,但稍微简短一些。我相对确定这里没有大数组的副本 - 花哨的索引将与视图一起使用。

import numpy as np

n_rows = 100
n_columns = 100000
n_rows_to_drop = 20
n_rows_to_add = 10

# Init huge array
data = np.random.rand(n_rows, n_columns)

# Some rows we drop
to_drop = np.random.randint(0, n_rows, n_rows_to_drop)
to_drop = np.unique(to_drop)

# Some rows we add
new_data = np.random.rand(n_rows_to_add, n_columns)

# Start replacing rows with new rows
data[to_drop[:n_rows_to_add]] = new_data

# These should still be dropped
to_drop = to_drop[:n_rows_to_add]

# Make a list of row indices to keep, last rows first
to_keep = np.setdiff1d(np.arange(n_rows), to_drop, assume_unique=True)[-n_rows_to_add:]

# Replace rows to drop with rows at the end of the matrix
for to_drop_i, to_keep_i in zip(to_drop, to_keep):
    data[to_drop_i] = data[to_keep_i]

# Resize matrix in memory
data.resize(n_rows - n_rows_to_drop + n_rows_to_add, n_columns)