二维数组python中的矢量化计数

时间:2017-09-26 22:52:40

标签: python numpy vectorization

我正在尝试构建一个在列表中计数零的函数,直到出现非零条目,并再次从0开始计数。例如,

>>> a
array([[ 0,  0,  1,  0,  2],
       [ 0,  0,  0,  1,  1],
       [ 0,  1,  0,  0,  0],
       [ 0,  0, 10,  2,  2],
       [ 2,  0,  0,  0,  0]])

在这种情况下,我想要的输出将是

array([[1, 1, 0, 1, 0],
       [2, 2, 1, 0, 0],
       [3, 0, 2, 1, 1],
       [4, 1, 0, 0, 0],
       [0, 2, 1, 1, 1]])

我尝试使用两个for循环进行此操作,但在非常大的数据集上它非常慢。我希望我能找到一种方法来矢量化这个操作,这样时间就是O(n)而不是O(n ^ 2)。任何帮助将不胜感激!

2 个答案:

答案 0 :(得分:1)

这样的事情可能会让它更快一点:

a = np.array([[ 0,  0,  1,  0,  2],
              [ 0,  0,  0,  1,  1],
              [ 0,  1,  0,  0,  0],
              [ 0,  0, 10,  2,  2],
              [ 2,  0,  0,  0,  0]])

b = (a == 0)
c = np.zeros_like(a)
c[0, :] += b[0, :]
for i in range(1, c.shape[1]):
    c[i, :] = b[i, :] * (1 + c[i-1, :])

数组'c'给出了所需的结果。

或者进一步优化......

a = ...
b = (a == 0) * 1
for i in range(1, b.shape[1]):
    b[i, :] *= (1 + b[i-1, :])

现在'b'是你的结果,你有一个较少的数组来处理。

你会注意到这个算法仍然具有与'for for loop'解决方案相同的时间复杂度,但现在其中一个循环被numpy内部化,所以我希望在大型数组上加速。

答案 1 :(得分:0)

像这样的东西,它是矢量化的,没有for循环:

def moving_count(a, value, axis=0):
    """Return sequential counts of a given value along an axis"""
    if np.all(a == value):
        # Fill the output with counts along the desired axis
        return np.rollaxis(np.arange(1, a.size + 1).reshape(a.shape), axis)

    # Allocate output with a cumulative count of value along flattened axis
    output = np.cumsum(np.rollaxis(a, axis) == value)

    # Find locations of breakpoints
    breakpoints = (np.roll(output, 1) - output) == 0

    # Since breakpoints is boolean, argmax returns the location of the first breakpoint
    threshold = np.argmax(breakpoints)

    # Repeat the cumulative value along labels and subtract to reset the count at each breakpoint
    output[threshold:] -= output[breakpoints][np.cumsum(breakpoints) - 1][threshold:]

    # Reshape and return axis to match input array
    return np.rollaxis(output.reshape(a.shape), axis)

适用于您的问题:

In[3]: a = np.array([[ 0,  0,  1,  0,  2],
                     [ 0,  0,  0,  1,  1],
                     [ 0,  1,  0,  0,  0],
                     [ 0,  0, 10,  2,  2],
                     [ 2,  0,  0,  0,  0]])
In[4]: moving_count(a, 0, 1)
Out[4]: 

array([[1, 1, 0, 2, 0],
       [2, 2, 1, 0, 0],
       [3, 0, 2, 1, 1],
       [4, 1, 0, 0, 0],
       [0, 2, 1, 1, 1]])