Question

我有点像numpy的新手，所以如果这个问题已经被问到，我很抱歉。我正在寻找一种矢量化解决方案，它能够在一维numpy数组中运行多个不同大小的cumsum。

my_vector=np.array([1,2,3,4,5])
size_of_groups=np.array([3,2])

我想要像

这样的东西

np.cumsum.group(my_vector,size_of_groups)
[1,3,6,4,9]

我不想要一个带循环的解决方案。 numpy函数或numpy操作。

Answer 1

不确定numpy，但是pandas可以使用groupby + cumsum轻松完成此操作：

import pandas as pd

s = pd.Series(my_vector)
s.groupby(s.index.isin(size_of_groups.cumsum()).cumsum()).cumsum()

0    1
1    3
2    6
3    4
4    9
dtype: int64

Answer 2

这是一个矢量化解决方案 -

def intervaled_cumsum(ar, sizes):
    # Make a copy to be used as output array
    out = ar.copy()

    # Get cumumlative values of array
    arc = ar.cumsum()

    # Get cumsumed indices to be used to place differentiated values into
    # input array's copy
    idx = sizes.cumsum()

    # Place differentiated values that when cumumlatively summed later on would
    # give us the desired intervaled cumsum
    out[idx[0]] = ar[idx[0]] - arc[idx[0]-1]
    out[idx[1:-1]] = ar[idx[1:-1]] - np.diff(arc[idx[:-1]-1])
    return out.cumsum()

示例运行 -

In [114]: ar = np.array([1,2,3,4,5,6,7,8,9,10,11,12])
     ...: sizes = np.array([3,2,2,3,2])

In [115]: intervaled_cumsum(ar, sizes)
Out[115]: array([ 1,  3,  6,  4,  9,  6, 13,  8, 17, 27, 11, 23])

基准

其他方法 -

# @cᴏʟᴅsᴘᴇᴇᴅ's solution
import pandas as pd
def pandas_soln(my_vector, sizes):
    s = pd.Series(my_vector)
    return s.groupby(s.index.isin(sizes.cumsum()).cumsum()).cumsum().values

给定的样本使用了两个长度2和3的间隔。保留它并简单地为计时目的提供更多的组。

计时 -

In [146]: N = 10000 # number of groups
     ...: np.random.seed(0)
     ...: sizes = np.random.randint(2,4,(N))
     ...: ar = np.random.randint(0,N,sizes.sum())

In [147]: %timeit intervaled_cumsum(ar, sizes)
     ...: %timeit pandas_soln(ar, sizes)
10000 loops, best of 3: 178 µs per loop
1000 loops, best of 3: 1.82 ms per loop

In [148]: N = 100000 # number of groups
     ...: np.random.seed(0)
     ...: sizes = np.random.randint(2,4,(N))
     ...: ar = np.random.randint(0,N,sizes.sum())

In [149]: %timeit intervaled_cumsum(ar, sizes)
     ...: %timeit pandas_soln(ar, sizes)
100 loops, best of 3: 3.91 ms per loop
100 loops, best of 3: 17.3 ms per loop

In [150]: N = 1000000 # number of groups
     ...: np.random.seed(0)
     ...: sizes = np.random.randint(2,4,(N))
     ...: ar = np.random.randint(0,N,sizes.sum())

In [151]: %timeit intervaled_cumsum(ar, sizes)
     ...: %timeit pandas_soln(ar, sizes)
10 loops, best of 3: 31.6 ms per loop
1 loop, best of 3: 357 ms per loop

Answer 3

这是一个非传统的解决方案。但不是很快。（甚至比熊猫慢一点）。

>>> from scipy import linalg
>>> 
>>> N = len(my_vector)
>>> D = np.repeat((*zip((1,-1)),), N, axis=1)
>>> D[1, np.cumsum(size_of_groups) - 1] = 0
>>> 
>>> linalg.solve_banded((1, 0), D, my_vector)
array([1., 3., 6., 4., 9.])

numpy数组中的多个累积和

3 个答案:

基准