Question

我有一个整数标签数组，我想确定每个标签的数量，并将这些值存储在与输入相同大小的数组中。这可以通过以下循环完成：

def counter(labels):
    sizes = numpy.zeros(labels.shape)
    for num in numpy.unique(labels):
        mask = labels == num
        sizes[mask] = numpy.count_nonzero(mask)
return sizes

输入：

array = numpy.array([
       [0, 1, 2, 3],
       [0, 1, 1, 3],
       [3, 1, 3, 1]])

counter()返回：

array([[ 2.,  5.,  1.,  4.],
       [ 2.,  5.,  5.,  4.],
       [ 4.,  5.,  4.,  5.]])

但是，对于具有许多独特标签的大型阵列，在我的情况下为60,000，这需要相当长的时间。这是复杂算法的第一步，我不能在这一步上花费超过30秒。是否存在可以实现此功能的功能？如果没有，我怎样才能加速现有的循环？

Answer 1

方法＃1

这是使用np.unique -

的人

_, tags, count = np.unique(labels, return_counts=1, return_inverse=1)
sizes = count[tags]

方法＃2

使用labels中的正数，使用np.bincount更简单，更有效 -

sizes = np.bincount(labels)[labels]

运行时测试

使用60,000唯一正数和两个长度100,000和1000,000的设置进行设置。

设置＃1：

In [192]: np.random.seed(0)
     ...: labels = np.random.randint(0,60000,(100000))

In [193]: %%timeit
     ...: sizes = np.zeros(labels.shape)
     ...: for num in np.unique(labels):
     ...:     mask = labels == num
     ...:     sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 2.32 s per loop

In [194]: %timeit np.bincount(labels)[labels]
1000 loops, best of 3: 376 µs per loop

In [195]: 2320/0.376 # Speedup figure
Out[195]: 6170.212765957447

设置＃2：

In [196]: np.random.seed(0)
     ...: labels = np.random.randint(0,60000,(1000000))

In [197]: %%timeit
     ...: sizes = np.zeros(labels.shape)
     ...: for num in np.unique(labels):
     ...:     mask = labels == num
     ...:     sizes[mask] = np.count_nonzero(mask)
1 loop, best of 3: 43.6 s per loop

In [198]: %timeit np.bincount(labels)[labels]
100 loops, best of 3: 5.15 ms per loop

In [199]: 43600/5.15 # Speedup figure
Out[199]: 8466.019417475727

返回输入的每个特征的计数数组

1 个答案: