注释

Question

假设我有一个形状为b的数组(3, 10, 3)和另一个形状为v = [8, 9, 4]的数组(3,)，请参见下文。对于(10, 3)中3个形状为b的数组，我需要对v确定的行数求和，即对于i = 0, 1, 2，我需要得到{{ 1}}。我的解决方案（如下所示）使用for循环，我猜这是无效的。我想知道是否有一种有效的（矢量化）方法来完成我上面描述的操作。

注意：我的实际数组具有更大的维数，这些数组仅供说明。

np.sum(b[i, 0:v[i]], axis=0)

输出：

v = np.array([8,9,4])
b = np.array([[[0., 1., 0.],
               [0., 0., 1.],
               [0., 0., 1.],
               [0., 0., 1.],
               [1., 0., 0.],
               [1., 0., 0.],
               [0., 0., 1.],
               [1., 0., 0.],
               [0., 1., 0.],
               [1., 0., 0.]],

              [[0., 0., 1.],
               [0., 1., 0.],
               [1., 0., 0.],
               [0., 0., 1.],
               [1., 0., 0.],
               [1., 0., 0.],
               [1., 0., 0.],
               [0., 1., 0.],
               [0., 0., 1.],
               [0., 1., 0.]],

              [[1., 0., 0.],
               [1., 0., 0.],
               [1., 0., 0.],
               [0., 0., 1.],
               [0., 1., 0.],
               [0., 1., 0.],
               [1., 0., 0.],
               [1., 0., 0.],
               [0., 0., 1.],
               [1., 0., 0.]]])

n=v.shape[0]
vv=np.zeros([n, n])
for i in range(n):
   vv[i]=np.sum( b[i,0:v[i]],axis=0)

修改：下面是数组v和b的实际示例。

vv
array([[3., 1., 4.],
       [4., 2., 3.],
       [3., 0., 1.]])

我需要做的和以前一样，所以最终结果是形状为v= np.random.randint(0,300, size=(32, 98,3)) b = np.zeros([98, 3, 300, 3]) for i in range(3): for j in range(98): b[j,i] = np.random.multinomial(1,[1./3, 1./3, 1./3], 300) v.shape Out[292]: (32, 98, 3) b.shape Out[293]: (98, 3, 300, 3)的数组。请注意，我必须在每次迭代时都执行上述操作，这就是为什么我正在寻找一种有效的实现方式。

Answer 1

以下功能允许通过给定的轴以起始和终止数组指示的变化切片来缩小给定轴。它在引擎盖下使用np.ufunc.reduceat以及适当调整形状的输入数组和索引。它避免了不必要的计算，但分配了一个中间数组，该中间数组的大小是最终输出数组的两倍（但是，丢弃值的计算是无操作的。）

send.mail(from = "jsmith@amazon.com",
to = "jsmith@gmail.com",
subject = "subject",
body = "msg", 
authenticate = TRUE,
smtp = list(host.name = "smtp.office365.com", port = 587,
        user.name = "jsmith@amazon.com", passwd = "pw!", tls = TRUE))


org.apache.commons.mail.EmailException: Sending the email to the following 
server failed : smtp.office365.com:587
at org.apache.commons.mail.Email.sendMimeMessage(Email.java:1410)
at org.apache.commons.mail.Email.send(Email.java:1437)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at RJavaTools.invokeMethod(RJavaTools.java:386)
Caused by: com.sun.mail.util.MailConnectException: Couldn't connect to 
host, port: smtp.office365.com, 587; timeout 60000;
nested exception is:

对于OP中的示例，可以按以下方式使用它：

def sliced_reduce(a, i, j, ufunc, axis=None):
    """Reduce an array along a given axis for varying slices `a[..., i:j, ...]` where `i` and `j` are arrays themselves.

    Parameters
    ----------
    a : array
        The array to be reduced.
    i : array
        Start indices for the reduced axis. Must have the same shape as `j`.
    j : array
        Stop indices for the reduced axis. Must have the same shape as `i`.
    ufunc : function
        The function used for reducing the indicated axis.
    axis : int, optional
        Axis to be reduced. Defaults to `len(i.shape)`.

    Returns
    -------
    array
        Shape `i.shape + a.shape[axis+1:]`.

    Notes
    -----
    The shapes of `a` and `i`, `j` must match up to the reduced axis.
    That means `a.shape[:axis] == i.shape[len(i.shape) - axis:]``. 
    `i` and `j` can have additional leading dimensions and `a` can have additional trailing dimensions.
    """
    if axis is None:
        axis = len(i.shape)
    indices = np.tile(
        np.repeat(
            np.arange(np.prod(a.shape[:axis])) * a.shape[axis],
            2  # Repeat two times to have start and stop indices next to each other.
        ),
        np.prod(i.shape[:len(i.shape) - axis])  # Perform summation for each element of additional axes.
    )
    # Add `a.shape[axis]` to account for negative indices.
    indices[::2] += (a.shape[axis] + i.ravel()) % a.shape[axis]
    indices[1::2] += (a.shape[axis] + j.ravel()) % a.shape[axis]
    # Now indices are sorted in ascending order but this will lead to unnecessary computation when reducing
    # from odd to even indices (since we're only interested in even to odd indices).
    # Hence we reverse the order of index pairs (need to reverse the result as well then).
    indices = indices.reshape(-1, 2)[::-1].ravel()
    result = ufunc.reduceat(a.reshape(-1, *a.shape[axis+1:]), indices)[::2]  # Select only even to odd.
    # In case start and stop index are equal (i.e. empty slice) `reduceat` will select the element
    # corresponding to the start index. Need to supply the correct default value in this case.
    result[indices[::2] == indices[1::2]] = ufunc.reduce([])
    return result[::-1].reshape(*(i.shape + a.shape[axis+1:]))  # Reverse order and reshape.

注释

颠倒平面索引对的顺序以使# 1. example: b = np.random.randint(0, 1000, size=(3, 10, 3)) v = np.random.randint(-9, 10, size=3) # Indexing into `b.shape[1]`. result = sliced_reduce(b, np.zeros_like(v), v, np.add) # 2. example: b = np.random.randint(0, 1000, size=(98, 3, 300, 3)) v = np.random.randint(-299, 300, size=(32, 98, 3)) # Indexing into `b.shape[2]`; one additional leading dimension for `v`. result = sliced_reduce(b, np.zeros_like(v), v, np.add, axis=2)并因此使用no-op每秒进行一次快捷计算似乎不是一个好主意（可能是因为未在内存中遍历扁平化的数组）布局顺序）。删除此部分并按升序使用平面索引可以使性能提高大约30％（对于perfplots也是如此，尽管其中没有包含）。

Answer 2

以下功能允许将给定轴与由起始和终止数组指示的变化切片求和。它在内部使用np.einsum以及适当计算的系数数组，该系数数组指示输入数组中的哪些元素应参与和（使用系数1和0）。依靠einsum使该实现与其他软件包兼容，例如PyTorch或TensorFlow（稍有变化）。由于每次加法运算都会伴随系数数组进行一次额外的乘法运算，因此它将所需的计算次数增加了一倍。

from string import ascii_lowercase as symbols
import numpy as np

def sliced_sum(a, i, j, axis=None):
    """Sum an array along a given axis for varying slices `a[..., i:j, ...]` where `i` and `j` are arrays themselves.

    Parameters
    ----------
    a : array
        The array to be summed over.
    i : array
        The start indices for the summation axis. Must have the same shape as `j`.
    j : array
        The stop indices for the summation axis. Must have the same shape as `i`.
    axis : int, optional
        Axis to be summed over. Defaults to `len(i.shape)`.

    Returns
    -------
    array
        Shape `i.shape + a.shape[axis+1:]`.

    Notes
    -----
    The shapes of `a` and `i`, `j` must match up to the summation axis.
    That means `a.shape[:axis] == i.shape[len(i.shape) - axis:]``. 
    `i` and `j` can have additional leading dimensions and `a` can have additional trailing dimensions.
    """
    if axis is None:
        axis = len(i.shape)

    # Compute number of leading, common and trailing dimensions.
    l = len(i.shape) - axis      # Number of leading dimensions.
    m = len(i.shape) - l         # Number of common dimensions.
    n = len(a.shape) - axis - 1  # Number of trailing dimensions.

    # Select the corresponding symbols for `np.einsum`.
    leading = symbols[:l]
    common = symbols[l:l+m]
    summation = symbols[l+m]
    trailing = symbols[l+m+1:l+m+1+n]

    # Convert negative indices.
    i = (a.shape[axis] + i) % a.shape[axis]
    j = (a.shape[axis] + j) % a.shape[axis]

    # Compute the "active" elements, i.e. the ones that should participate in the summation.
    # "active" elements have a coefficient of 1 (True), others are 0 (False).
    indices, i, j = np.broadcast_arrays(np.arange(a.shape[axis]),
                                        np.expand_dims(i, -1), np.expand_dims(j, -1))
    active_elements = (i <= indices) & (indices < j)
    return np.einsum(f'{leading + common + summation},{common + summation + trailing}->{leading + common + trailing}',
                     active_elements, a)

对于OP中的示例，可以按以下方式使用它：

# 1. example:
b = np.random.randint(0, 1000, size=(3, 10, 3))
v = np.random.randint(-9, 10, size=3)  # Indexing into `b.shape[1]`.
result = sliced_sum(b, np.zeros_like(v), v)

# 2. example:
b = np.random.randint(0, 1000, size=(98, 3, 300, 3))
v = np.random.randint(-299, 300, size=(32, 98, 3))  # Indexing into `b.shape[2]`; one additional leading dimension for `v`.
result = sliced_sum(b, np.zeros_like(v), v, axis=2)

Answer 3

另一种选择是使用Numba来加快循环速度。这样可以避免不必要的计算和内存分配，并且与所有numpy函数完全兼容（即prod等也可以类似地工作）。

import numba
import numpy as np

def sliced_sum_numba(a, i, j, axis=None):
    """Sum an array along a given axis for varying slices `a[..., i:j, ...]` where `i` and `j` are arrays themselves.

    Parameters
    ----------
    a : array
        The array to be summed over.
    i : array
        The start indices for the summation axis. Must have the same shape as `j`.
    j : array
        The stop indices for the summation axis. Must have the same shape as `i`.
    axis : int, optional
        Axis to be summed over. Defaults to `len(i.shape)`.

    Returns
    -------
    array
        Shape `i.shape + a.shape[axis+1:]`.

    Notes
    -----
    The shapes of `a` and `i`, `j` must match up to the summation axis.
    That means `a.shape[:axis] == i.shape[len(i.shape) - axis:]``. 
    `i` and `j` can have additional leading dimensions and `a` can have additional trailing dimensions.
    """
    if axis is None:
        axis = len(i.shape)
    # Convert negative indices.
    i = (a.shape[axis] + i) % a.shape[axis]
    j = (a.shape[axis] + j) % a.shape[axis]
    # Operate on a flattened version of the array (dimensions up to `axis` are flattened).
    m = np.prod(i.shape[:len(i.shape) - axis], dtype=int)  # Elements in leading dimensions.
    n = np.prod(i.shape[len(i.shape) - axis:], dtype=int)  # Elements in common dimensions.
    a_flat = a.reshape(-1, *a.shape[axis:])
    i_flat = i.ravel()
    j_flat = j.ravel()
    result = np.empty((m*n,) + a.shape[axis+1:], dtype=a.dtype)
    numba_sum(a_flat, i_flat, j_flat, m, n, result)
    return result.reshape(*(i.shape + a.shape[axis+1:]))

@numba.jit(parallel=True, nopython=True)
def numba_sum(a, i, j, m, n, out):
    for index in numba.prange(m*n):
        out[index] = np.sum(a[index % n, i[index]:j[index]], axis=0)

对于OP中的示例，可以按以下方式使用它：

# 1. example:
b = np.random.randint(0, 1000, size=(3, 10, 3))
v = np.random.randint(-9, 10, size=3)  # Indexing into `b.shape[1]`.
result = sliced_sum_numba(b, np.zeros_like(v), v)

# 2. example:
b = np.random.randint(0, 1000, size=(98, 3, 300, 3))
v = np.random.randint(-299, 300, size=(32, 98, 3))  # Indexing into `b.shape[2]`; one additional leading dimension for `v`.
result = sliced_sum_numba(b, np.zeros_like(v), v, axis=2)

Answer 4

这是答案中提出的不同方法的性能比较：

sliced_reduce
sliced_sum
sliced_sum_numba
reduce_cumulative（原始想法here）
baseline-“经典” Python for循环（请参见下文）。

性能说明

sliced_reduce将索引对的顺序从升序转换为降序，以将多余元素的计算转换为无操作；但是，这种方式不会按内存布局顺序遍历数组，并且似乎使该方法的速度降低了约30％。
reduce_cumulative执行许多不必要的加法运算，具体取决于开始索引和停止索引的分布。对于开始索引全为零且停止索引均匀分布的OP示例，平均而言，这将是严格必要次数的两倍。对于其他分布（例如非零起始索引），该分数可能会发生很大变化，因此与其他方法相比会降低性能。请检查您自己的情况。
[免责声明] 与所有性能评估一样，这些都是粗略的指导原则，可以提供广泛的概述，但是它们并不能避免您在特定计算机上针对特定用例自己运行性能测试绝对确定选择最佳选项。

使用OP中的示例尺寸：

In [15]: np.random.seed(0)

In [16]: b = np.random.randint(0, 1000, size=(98, 3, 300, 3))

In [17]: v = np.random.randint(-299, 300, size=(32, 98, 3))

In [18]: %timeit sliced_reduce(b, np.zeros_like(v), v, np.add, axis=2)
11.3 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [19]: %timeit sliced_sum(b, np.zeros_like(v), v, axis=2)
54.9 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [20]: %timeit sliced_sum_numba(b, np.zeros_like(v), v, 2)
16.3 ms ± 609 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [21]: %timeit reduce_cumulative(b, np.zeros_like(v), v, np.add, axis=2)
2.05 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [22]: %timeit baseline(b, np.zeros_like(v), v, axis=2)
79 ms ± 625 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

基准实施：

def baseline(a, i, j, axis=None):
    if axis is None:
        axis = len(i.shape)
    i = (a.shape[axis] + i) % a.shape[axis]
    j = (a.shape[axis] + j) % a.shape[axis]
    m = len(i.shape) - axis
    result = np.empty(i.shape + a.shape[axis+1:], dtype=a.dtype)
    for k in np.ndindex(i.shape):
        result[k] = np.sum(a[k[m:] + (slice(i[k], j[k]),)], axis=0)
    return result

性能图

除了OP特定示例情况的计时外，检查算法如何根据数据和索引数组的大小进行扩展也很有帮助。在这里，我们可以将形状分为三个不同的部分：

索引数组的前导尺寸（数据数组中不存在的尺寸）。在OP示例中，这是(32,)。
索引和数据数组的公共尺寸（前导尺寸到缩小轴的尺寸）。在OP示例中，这是(98, 3)。
要减小的轴尺寸。在OP示例中，这是300。
（所有算法对数据数组的尾部尺寸都进行了类似的处理，因此不需要进行特定的缩放。）

因此，我们可以针对三种不同情况创建性能图：改变前导尺寸，公共尺寸和要减小的轴尺寸。边界是从1到N中选择的，其中N是2的最大幂，因此所涉及的数组中不超过5,000,000个元素（输入，索引，输出，中间数组）可能更大（例如sliced_reduce））。

有关代码，请参见下文。

领先尺寸

常用尺寸

尺寸缩小

代码

from string import ascii_lowercase as symbols
import numba
import numpy as np
import perfplot

np.random.seed(0)

def sliced_reduce(a, i, j, ufunc=np.add, axis=2):
    indices = np.tile(
        np.repeat(
            np.arange(np.prod(a.shape[:axis])) * a.shape[axis],
            2
        ),
        np.prod(i.shape[:len(i.shape) - axis])
    )
    indices[::2] += (a.shape[axis] + i.ravel()) % a.shape[axis]
    indices[1::2] += (a.shape[axis] + j.ravel()) % a.shape[axis]
    indices = indices.reshape(-1, 2)[::-1].ravel()  # This seems to be counter-effective, please check for your own case.
    result = ufunc.reduceat(a.reshape(-1, *a.shape[axis+1:]), indices)[::2]  # Select only even to odd.
    result[indices[::2] == indices[1::2]] = ufunc.reduce([])
    return result[::-1].reshape(*(i.shape + a.shape[axis+1:]))

def sliced_sum(a, i, j, axis=2):
    l = len(i.shape) - axis
    m = len(i.shape) - l
    n = len(a.shape) - axis - 1
    leading = symbols[:l]
    common = symbols[l:l+m]
    summation = symbols[l+m]
    trailing = symbols[l+m+1:l+m+1+n]
    i = (a.shape[axis] + i) % a.shape[axis]
    j = (a.shape[axis] + j) % a.shape[axis]
    indices, i, j = np.broadcast_arrays(np.arange(a.shape[axis]),
                                        np.expand_dims(i, -1), np.expand_dims(j, -1))
    active_elements = (i <= indices) & (indices < j)
    return np.einsum(f'{leading + common + summation},{common + summation + trailing}->{leading + common + trailing}',
                     active_elements, a)

def sliced_sum_numba(a, i, j, axis=2):
    i = (a.shape[axis] + i) % a.shape[axis]
    j = (a.shape[axis] + j) % a.shape[axis]
    m = np.prod(i.shape[:len(i.shape) - axis], dtype=int)
    n = np.prod(i.shape[len(i.shape) - axis:], dtype=int)
    a_flat = a.reshape(-1, *a.shape[axis:])
    i_flat = i.ravel()
    j_flat = j.ravel()
    result = np.empty((m*n,) + a.shape[axis+1:], dtype=a.dtype)
    numba_sum(a_flat, i_flat, j_flat, m, n, result)
    return result.reshape(*(i.shape + a.shape[axis+1:]))

@numba.jit(parallel=True, nopython=True)
def numba_sum(a, i, j, m, n, out):
    for index in numba.prange(m*n):
        out[index] = np.sum(a[index % n, i[index]:j[index]], axis=0)

def reduce_cumulative(a, i, j, ufunc=np.add, axis=2):
    i = (a.shape[axis] + i) % a.shape[axis]
    j = (a.shape[axis] + j) % a.shape[axis]
    a = np.insert(a, 0, 0, axis)
    c = ufunc.accumulate(a, axis=axis)
    pre = np.ix_(*(range(x) for x in i.shape))
    l = len(i.shape) - axis
    return c[pre[l:] + (j,)] - c[pre[l:] + (i,)]

def baseline(a, i, j, axis=2):
    i = (a.shape[axis] + i) % a.shape[axis]
    j = (a.shape[axis] + j) % a.shape[axis]
    m = len(i.shape) - axis
    result = np.empty(i.shape + a.shape[axis+1:], dtype=a.dtype)
    for k in np.ndindex(i.shape):
        result[k] = np.sum(a[k[m:] + (slice(i[k], j[k]),)], axis=0)
    return result

a = np.random.randint(0, 1000, size=(98, 3, 300, 3))
j = np.random.randint(-299, 300, size=(32, 98, 3))
i = np.zeros_like(j)
check = [f(a, i, j) for f in [sliced_reduce, sliced_sum, sliced_sum_numba, reduce_cumulative, baseline]]
assert all(np.array_equal(check[0], x) for x in check[1:])

perfplot.show(
    # Leading dimensions:
    # setup = lambda n: (np.random.randint(0, 1000, size=(98, 3, 300, 3)),
    #                    np.zeros((n, 98, 3), dtype=int),
    #                    np.random.randint(-299, 300, size=(n, 98, 3))),
    # Common dimensions:
    # setup = lambda n: (np.random.randint(0, 1000, size=(n, 3, 300, 3)),
    #                    np.zeros((32, n, 3), dtype=int),
    #                    np.random.randint(-299, 300, size=(32, n, 3))),
    # Reduced dimension:
    setup = lambda n: (np.random.randint(0, 1000, size=(98, 3, n, 3)),
                       np.zeros((32, 98, 3), dtype=int),
                       np.random.randint(-n+1, n, size=(32, 98, 3))),
    kernels=[
        lambda a: sliced_reduce(*a),
        lambda a: sliced_sum(*a),
        lambda a: sliced_sum_numba(*a),
        lambda a: reduce_cumulative(*a),
        lambda a: baseline(*a),
    ],
    labels=['sliced_reduce', 'sliced_sum', 'sliced_sum_numba', 'reduce_cumulative', 'baseline'],
    # n_range=[2 ** k for k in range(13)],  # Leading dimensions.
    # n_range=[2 ** k for k in range(11)],  # Common dimensions.
    n_range=[2 ** k for k in range(2, 13)],  # Reduced dimension.
    # xlabel='Size of leading dimension',
    # xlabel='Size of first common dimension (second is 3)',
    xlabel='Size of reduced dimension',
)

Answer 5

由this answer（因此，社区Wiki）提出的另一种想法是使用np.cumsum，然后选择与切片索引相对应的行。通过在要减少的轴的起点处插入一个附加的零行，可以处理零索引。该方法执行不必要的计算，因为它会计算除最终索引之外的全部累计和。如果停止索引沿轴均匀分布（中位数为input_array.shape[axis]//2），则平均而言，这将执行所需的两倍加法运算。不过，与其他方法相比，该方法似乎perform quite well（至少对于OP指出的尺寸而言）。

def reduce_cumulative(a, i, j, ufunc, axis=None):
    if axis is None:
        axis = len(i.shape)
    i = (a.shape[axis] + i) % a.shape[axis]
    j = (a.shape[axis] + j) % a.shape[axis]
    a = np.insert(a, 0, 0, axis)  # Insert zeros to account for zero indices.
    c = ufunc.accumulate(a, axis=axis)
    pre = np.ix_(*(range(x) for x in i.shape))  # Indices for dimensions prior to `axis`.
    l = len(i.shape) - axis  # Number of leading dimensions in `i` and `j`.
    return c[pre[l:] + (j,)] - c[pre[l:] + (i,)]

Answer 6

这是一本单子。没有保证这是最有效的版本，因为它做了很多不必要的添加：

In [25]: b.cumsum(axis=1)[np.arange(b.shape[0]), v-1]                                                          
Out[25]: 
array([[3., 1., 4.],
       [4., 2., 3.],
       [3., 0., 1.]])

（还要注意，它无法正确处理v中的0。）

每次以不同的切片范围沿不同的暗度求和

6 个答案:

注释

性能说明

性能图

领先尺寸

常用尺寸

尺寸缩小

代码