我正在尝试构建一个在列表中计数零的函数,直到出现非零条目,并再次从0开始计数。例如,
>>> a
array([[ 0, 0, 1, 0, 2],
[ 0, 0, 0, 1, 1],
[ 0, 1, 0, 0, 0],
[ 0, 0, 10, 2, 2],
[ 2, 0, 0, 0, 0]])
在这种情况下,我想要的输出将是
array([[1, 1, 0, 1, 0],
[2, 2, 1, 0, 0],
[3, 0, 2, 1, 1],
[4, 1, 0, 0, 0],
[0, 2, 1, 1, 1]])
我尝试使用两个for循环进行此操作,但在非常大的数据集上它非常慢。我希望我能找到一种方法来矢量化这个操作,这样时间就是O(n)而不是O(n ^ 2)。任何帮助将不胜感激!
答案 0 :(得分:1)
这样的事情可能会让它更快一点:
a = np.array([[ 0, 0, 1, 0, 2],
[ 0, 0, 0, 1, 1],
[ 0, 1, 0, 0, 0],
[ 0, 0, 10, 2, 2],
[ 2, 0, 0, 0, 0]])
b = (a == 0)
c = np.zeros_like(a)
c[0, :] += b[0, :]
for i in range(1, c.shape[1]):
c[i, :] = b[i, :] * (1 + c[i-1, :])
数组'c'给出了所需的结果。
或者进一步优化......
a = ...
b = (a == 0) * 1
for i in range(1, b.shape[1]):
b[i, :] *= (1 + b[i-1, :])
现在'b'是你的结果,你有一个较少的数组来处理。
你会注意到这个算法仍然具有与'for for loop'解决方案相同的时间复杂度,但现在其中一个循环被numpy内部化,所以我希望在大型数组上加速。
答案 1 :(得分:0)
像这样的东西,它是矢量化的,没有for
循环:
def moving_count(a, value, axis=0):
"""Return sequential counts of a given value along an axis"""
if np.all(a == value):
# Fill the output with counts along the desired axis
return np.rollaxis(np.arange(1, a.size + 1).reshape(a.shape), axis)
# Allocate output with a cumulative count of value along flattened axis
output = np.cumsum(np.rollaxis(a, axis) == value)
# Find locations of breakpoints
breakpoints = (np.roll(output, 1) - output) == 0
# Since breakpoints is boolean, argmax returns the location of the first breakpoint
threshold = np.argmax(breakpoints)
# Repeat the cumulative value along labels and subtract to reset the count at each breakpoint
output[threshold:] -= output[breakpoints][np.cumsum(breakpoints) - 1][threshold:]
# Reshape and return axis to match input array
return np.rollaxis(output.reshape(a.shape), axis)
适用于您的问题:
In[3]: a = np.array([[ 0, 0, 1, 0, 2],
[ 0, 0, 0, 1, 1],
[ 0, 1, 0, 0, 0],
[ 0, 0, 10, 2, 2],
[ 2, 0, 0, 0, 0]])
In[4]: moving_count(a, 0, 1)
Out[4]:
array([[1, 1, 0, 2, 0],
[2, 2, 1, 0, 0],
[3, 0, 2, 1, 1],
[4, 1, 0, 0, 0],
[0, 2, 1, 1, 1]])