Question

我有一个2D数组（一个混淆矩阵），例如（3,3）。数组中的数字是指向一组标签的索引。我知道对于5行和列标签，此数组实际上应该是（5,5）而不是（3,3）。我可以找到被“点击”的标签：

import numpy as np

x = np.array([[3, 0, 3],
              [0, 2, 0],
              [2, 3, 3]])
labels = ["a", "b", "c", "d", "e"]
missing_idxs = np.setdiff1d(np.arange(len(labels)), x)  # array([1, 4]

我知道丢失索引的行和列都为零，所以我想要的输出是这样：

y = np.array([[3, 0, 0, 3, 0],
              [0, 0, 0, 0, 0],  # <- Inserted row at index 1 all zeros
              [0, 0, 2, 0, 0],
              [2, 0, 3, 3, 0],
              [0, 0, 0, 0, 0]])  # <- Inserted row at index 4 all zeros
              #   ^        ^
              #   |        |
              # Inserted columns at index 1 and 4 all zeros

我可以在所有丢失的索引中循环调用np.insert来做到这一点：

def insert_rows_columns_at_slow(arr, indices):
    result = arr.copy()
    for idx in indices:
        result = np.insert(result, idx, np.zeros(result.shape[1]), 0)
        result = np.insert(result, idx, np.zeros(result.shape[0]), 1)

但是，我的实际数组要大得多，并且可能还有更多的缺失索引。由于np.insert每次都会重新分配，因此效率不高。

如何以更有效的矢量化方式获得相同的结果？如果可以在2个以上的维度上使用，则可获得加分。

Answer 1

您可以通过预先分配完整的结果数组并用旧数组填充行和列来做到这一点，即使是在多个维度上，并且这些维度也不必与大小匹配：

def insert_at(arr, output_size, indices):
    """
    Insert zeros at specific indices over whole dimensions, e.g. rows and/or columns and/or channels.
    You need to specify indices for each dimension, or leave a dimension untouched by specifying
    `...` for it. The following assertion should hold:

            `assert len(output_size) == len(indices) == len(arr.shape)`

    :param arr: The array to insert zeros into
    :param output_size: The size of the array after insertion is completed
    :param indices: The indices where zeros should be inserted, per dimension. For each dimension, you can 
                specify: - an int
                         - a tuple of ints
                         - a generator yielding ints (such as `range`)
                         - Ellipsis (=...)
    :return: An array of shape `output_size` with the content of arr and zeros inserted at the given indices.
"""
    # assert len(output_size) == len(indices) == len(arr.shape)
    result = np.zeros(output_size)
    existing_indices = [np.setdiff1d(np.arange(axis_size), axis_indices,assume_unique=True)
                        for axis_size, axis_indices in zip(output_size, indices)]
    result[np.ix_(*existing_indices)] = arr
    return result

对于您的用例，您可以像这样使用它：

def fill_by_label(arr, labels):
    # If this is your only use-case, you can make it more efficient
    # By not computing the missing indices first, just to compute
    # The existing indices again
    missing_idxs = np.setdiff1d(np.arange(len(labels)), x)
    return insert_at(arr, output_size=(len(labels), len(labels)),
                                       indices=(missing_idxs, missing_idxs))

x = np.array([[3, 0, 3],
              [0, 2, 0],
              [2, 3, 3]])
labels = ["a", "b", "c", "d", "e"]
missing_idxs = np.setdiff1d(np.arange(len(labels)), x)
print(fill_by_label(x, labels))
>> [[3. 0. 0. 3. 0.]
    [0. 0. 0. 0. 0.]
    [0. 0. 2. 0. 0.]
    [2. 0. 3. 3. 0.]
    [0. 0. 0. 0. 0.]]

但这非常灵活。您可以将其用于零填充：

def zero_pad(arr):
    out_size = np.array(arr.shape) + 2
    indices = (0, out_size[0] - 1), (0, out_size[1] - 1)
    return insert_at(arr, output_size=out_size,
                                       indices=indices)

print(zero_pad(x))
>> [[0. 0. 0. 0. 0.]
    [0. 3. 0. 3. 0.]
    [0. 0. 2. 0. 0.]
    [0. 2. 3. 3. 0.]
    [0. 0. 0. 0. 0.]]

它也可用于非二次输入和输出：

x = np.ones((3, 4))
print(insert_at(x, (4, 5), (2, 3)))
>>[[1. 1. 1. 0. 1.]
   [1. 1. 1. 0. 1.]
   [0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 1.]]

每个维度的插入次数不同：

x = np.ones((3, 4))
print(insert_at(x, (4, 6), (1, (2, 4))))
>> [[1. 1. 0. 1. 0. 1.]
    [0. 0. 0. 0. 0. 0.]
    [1. 1. 0. 1. 0. 1.]
    [1. 1. 0. 1. 0. 1.]]

您可以使用range（或其他生成器）代替枚举每个索引：

x = np.ones((3, 4))
print(insert_at(x, (4, 6), (1, range(2, 4))))
>>[[1. 1. 0. 0. 1. 1.]
   [0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 1. 1.]
   [1. 1. 0. 0. 1. 1.]]

它适用于任意尺寸（只要您为每个尺寸指定索引）¹：

x = np.ones((2, 2, 2))
print(insert_at(x, (3, 3, 3), (0, 0, 0)))
>>>[[[0. 0. 0.]
     [0. 0. 0.]
     [0. 0. 0.]]

    [[0. 0. 0.]
     [0. 1. 1.]
     [0. 1. 1.]]

    [[0. 0. 0.]
     [0. 1. 1.]
     [0. 1. 1.]]]

您可以使用Ellipsis（= ...）来表示您不想更改尺寸^1,2：

x = np.ones((2, 2))
print(insert_at(x, (2, 4), (..., (0, 1))))
>>[[0. 0. 1. 1.]
   [0. 0. 1. 1.]]

¹：您可以根据arr.shape和output_size自动检测到这一点，并根据需要用...进行填充，但我将其保留如果需要的话给你。如果愿意，您可以改用output_size参数，但是在传递生成器时会变得比较棘手。

²：这与普通的numpy ...语义有些不同，因为您需要为要保留的每个维指定...，即以下内容不工作：

x = np.ones((2, 2, 2))
print(insert_at(x, (2, 2, 3), (..., 0)))

为了计时，我将10行和列插入90x90数组中的次数为100000次，结果是：

x = np.random.random(size=(90, 90))
indices = np.arange(10) * 10


def measure_time_fast():
    insert_at(x, (100, 100), (indices, indices))


def measure_time_slow():
    insert_rows_columns_at_slow(x, indices)


if __name__ == '__main__':
    import timeit
    for speed in ("fast", "slow"):
        times = timeit.repeat(f"measure_time_{speed}()", setup=f"from __main__ import measure_time_{speed}", repeat=10, number=10000)
        print(f"Min: {np.min(times) / 10000}, Max: {np.max(times) / 10000}, Mean: {np.mean(times) / 10000} seconds per call")

对于快速版本：

最小值：7.336409069976071e-05，最大值：7.7440657400075e-05，平均值：每次通话7.520040466995852e-05秒

大约75微秒。

对于慢版本：

最小值：0.00028272533010022016，最大值：0.0002923079213000165，平均值：每次通话0.00028581595062998535秒

大约是300微秒。数组越大，差异越大。例如。将100行和列插入900x900数组中的结果如下（仅运行1000次）：

快速版本：

最小值：0.00022916630539984907，最大值：0.0022916630539984908，平均值：每次通话0.0022916630539984908秒

慢速版本：

最小值：0.013766934227399906，最大值：0.13766934227399907，平均值：每次通话0.13766934227399907秒

Answer 2

另一个选择：

使用不丢失的索引代替使用丢失的索引：

non_missing_idxs = np.union1d(np.arange(len(labels)), x)  # array([0, 2, 3])
y = np.zeros((5,5))
y[non_missing_idxs[:,None], non_missing_idxs] = x

输出：

array([[3., 0., 0., 3., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 2., 0., 0.],
       [2., 0., 3., 3., 0.],
       [0., 0., 0., 0., 0.]])

在特定索引处而不是末尾处同时插入零行和零列

2 个答案: