
时间:2018-12-20 14:11:18

标签: python numpy indexing vectorization

我有一个2D数组(一个混淆矩阵),例如(3,3)。数组中的数字是指向一组标签的索引。 我知道对于5行和列标签,此数组实际上应该是(5,5)而不是(3,3)。我可以找到被“点击”的标签:

import numpy as np

x = np.array([[3, 0, 3],
              [0, 2, 0],
              [2, 3, 3]])
labels = ["a", "b", "c", "d", "e"]
missing_idxs = np.setdiff1d(np.arange(len(labels)), x)  # array([1, 4]


y = np.array([[3, 0, 0, 3, 0],
              [0, 0, 0, 0, 0],  # <- Inserted row at index 1 all zeros
              [0, 0, 2, 0, 0],
              [2, 0, 3, 3, 0],
              [0, 0, 0, 0, 0]])  # <- Inserted row at index 4 all zeros
              #   ^        ^
              #   |        |
              # Inserted columns at index 1 and 4 all zeros


def insert_rows_columns_at_slow(arr, indices):
    result = arr.copy()
    for idx in indices:
        result = np.insert(result, idx, np.zeros(result.shape[1]), 0)
        result = np.insert(result, idx, np.zeros(result.shape[0]), 1)



2 个答案:

答案 0 :(得分:0)


def insert_at(arr, output_size, indices):
    Insert zeros at specific indices over whole dimensions, e.g. rows and/or columns and/or channels.
    You need to specify indices for each dimension, or leave a dimension untouched by specifying
    `...` for it. The following assertion should hold:

            `assert len(output_size) == len(indices) == len(arr.shape)`

    :param arr: The array to insert zeros into
    :param output_size: The size of the array after insertion is completed
    :param indices: The indices where zeros should be inserted, per dimension. For each dimension, you can 
                specify: - an int
                         - a tuple of ints
                         - a generator yielding ints (such as `range`)
                         - Ellipsis (=...)
    :return: An array of shape `output_size` with the content of arr and zeros inserted at the given indices.
    # assert len(output_size) == len(indices) == len(arr.shape)
    result = np.zeros(output_size)
    existing_indices = [np.setdiff1d(np.arange(axis_size), axis_indices,assume_unique=True)
                        for axis_size, axis_indices in zip(output_size, indices)]
    result[np.ix_(*existing_indices)] = arr
    return result


def fill_by_label(arr, labels):
    # If this is your only use-case, you can make it more efficient
    # By not computing the missing indices first, just to compute
    # The existing indices again
    missing_idxs = np.setdiff1d(np.arange(len(labels)), x)
    return insert_at(arr, output_size=(len(labels), len(labels)),
                                       indices=(missing_idxs, missing_idxs))

x = np.array([[3, 0, 3],
              [0, 2, 0],
              [2, 3, 3]])
labels = ["a", "b", "c", "d", "e"]
missing_idxs = np.setdiff1d(np.arange(len(labels)), x)
print(fill_by_label(x, labels))
>> [[3. 0. 0. 3. 0.]
    [0. 0. 0. 0. 0.]
    [0. 0. 2. 0. 0.]
    [2. 0. 3. 3. 0.]
    [0. 0. 0. 0. 0.]]


def zero_pad(arr):
    out_size = np.array(arr.shape) + 2
    indices = (0, out_size[0] - 1), (0, out_size[1] - 1)
    return insert_at(arr, output_size=out_size,

>> [[0. 0. 0. 0. 0.]
    [0. 3. 0. 3. 0.]
    [0. 0. 2. 0. 0.]
    [0. 2. 3. 3. 0.]
    [0. 0. 0. 0. 0.]]


x = np.ones((3, 4))
print(insert_at(x, (4, 5), (2, 3)))
>>[[1. 1. 1. 0. 1.]
   [1. 1. 1. 0. 1.]
   [0. 0. 0. 0. 0.]
   [1. 1. 1. 0. 1.]]


x = np.ones((3, 4))
print(insert_at(x, (4, 6), (1, (2, 4))))
>> [[1. 1. 0. 1. 0. 1.]
    [0. 0. 0. 0. 0. 0.]
    [1. 1. 0. 1. 0. 1.]
    [1. 1. 0. 1. 0. 1.]]


x = np.ones((3, 4))
print(insert_at(x, (4, 6), (1, range(2, 4))))
>>[[1. 1. 0. 0. 1. 1.]
   [0. 0. 0. 0. 0. 0.]
   [1. 1. 0. 0. 1. 1.]
   [1. 1. 0. 0. 1. 1.]]

它适用于任意尺寸(只要您为每个尺寸指定索引) 1

x = np.ones((2, 2, 2))
print(insert_at(x, (3, 3, 3), (0, 0, 0)))
>>>[[[0. 0. 0.]
     [0. 0. 0.]
     [0. 0. 0.]]

    [[0. 0. 0.]
     [0. 1. 1.]
     [0. 1. 1.]]

    [[0. 0. 0.]
     [0. 1. 1.]
     [0. 1. 1.]]]

您可以使用Ellipsis(= ...)来表示您不想更改尺寸 1,2

x = np.ones((2, 2))
print(insert_at(x, (2, 4), (..., (0, 1))))
>>[[0. 0. 1. 1.]
   [0. 0. 1. 1.]]

1 :您可以根据arr.shapeoutput_size自动检测到这一点,并根据需要用...进行填充,但我将其保留如果需要的话给你。如果愿意,您可以改用output_size参数,但是在传递生成器时会变得比较棘手。

2 :这与普通的numpy ...语义有些不同,因为您需要为要保留的每个维指定...,即以下内容工作:

x = np.ones((2, 2, 2))
print(insert_at(x, (2, 2, 3), (..., 0)))


x = np.random.random(size=(90, 90))
indices = np.arange(10) * 10

def measure_time_fast():
    insert_at(x, (100, 100), (indices, indices))

def measure_time_slow():
    insert_rows_columns_at_slow(x, indices)

if __name__ == '__main__':
    import timeit
    for speed in ("fast", "slow"):
        times = timeit.repeat(f"measure_time_{speed}()", setup=f"from __main__ import measure_time_{speed}", repeat=10, number=10000)
        print(f"Min: {np.min(times) / 10000}, Max: {np.max(times) / 10000}, Mean: {np.mean(times) / 10000} seconds per call")



最小值:7.336409069976071e-05,最大值:7.7440657400075e-05,平均值:   每次通话7.520040466995852e-05秒




最小值:0.00028272533010022016,最大值:0.0002923079213000165,平均值:   每次通话0.00028581595062998535秒

大约是300微秒。 数组越大,差异越大。例如。将100行和列插入900x900数组中的结果如下(仅运行1000次):



最小值:0.00022916630539984907,最大值:0.0022916630539984908,平均值:   每次通话0.0022916630539984908秒



最小值:0.013766934227399906,最大值:0.13766934227399907,平均值:   每次通话0.13766934227399907秒

答案 1 :(得分:0)



non_missing_idxs = np.union1d(np.arange(len(labels)), x)  # array([0, 2, 3])
y = np.zeros((5,5))
y[non_missing_idxs[:,None], non_missing_idxs] = x


array([[3., 0., 0., 3., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 2., 0., 0.],
       [2., 0., 3., 3., 0.],
       [0., 0., 0., 0., 0.]])