Question

我有一个排序的整数数组，例如[0, 0, 1, 1, 1, 2, 4, 4]，我想确定整数块的起始位置和块的长度。块大小很小但阵列本身可能非常大，因此效率很重要。块的总数也是已知的。

numpy.unique可以解决问题：

import numpy


a = numpy.array([0, 0, 1, 1, 1, 2, 4, 4])
num_blocks = 4
print(a)

_, idx_start, count = numpy.unique(a, return_index=True, return_counts=True)

print(idx_start)
print(count)

[0 0 1 1 1 2 4 4]
[0 2 5 6]
[2 3 1 2]

但很慢。我认为，鉴于输入数组的特定结构，有一个更有效的解决方案。

例如，像

这样简单的事情

import numpy

a = numpy.array([0, 0, 1, 1, 1, 2, 3, 3])
num_blocks = 4


k = 0
z = a[k]
block_idx = 0
counts = numpy.empty(num_blocks, dtype=int)
count = 0
while k < len(a):
    if z == a[k]:
        count += 1
    else:
        z = a[k]
        counts[block_idx] = count
        count = 1
        block_idx += 1
    k += 1
counts[block_idx] = count

print(counts)

给出了块大小，而简单的numpy.cumsum会给出index_start。当然，使用Python循环很慢。

任何提示？

Answer 1

这是一个有掩饰和切片的人 -

def grp_start_len(a):
    m = np.r_[True,a[:-1] != a[1:],True] #np.concatenate for a bit more boost
    idx = np.flatnonzero(m)
    return idx[:-1], np.diff(idx)

示例运行 -

In [18]: a
Out[18]: array([0, 0, 1, 1, 1, 2, 4, 4])

In [19]: grp_start_len(a)
Out[19]: (array([0, 2, 5, 6]), array([2, 3, 1, 2]))

Timings（来自@AGN Gazer＆＃39的解决方案） -

In [24]: np.random.seed(0)

In [25]: a = np.sort(np.random.randint(1, 10000, 10000))

In [26]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
1000 loops, best of 3: 411 µs per loop

# @AGN Gazer's solution
In [27]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
10000 loops, best of 3: 81.2 µs per loop

In [28]: %timeit grp_start_len(a)
10000 loops, best of 3: 60.1 µs per loop

更多尺寸10x -

In [40]: np.random.seed(0)

In [41]: a = np.sort(np.random.randint(1, 100000, 100000))

In [42]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
    ...: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
    ...: %timeit grp_start_len(a)
100 loops, best of 3: 5.34 ms per loop
1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 463 µs per loop

Answer 2

np.where(np.ediff1d(a, None, a[0]))[0]

如果您想在答案中使用第一个“0”，请在a[0]添加非零数字：

np.where(np.ediff1d(a, None, a[0] + 1))[0]

编辑（块长度）：

啊，只是注意到你也希望得到块长度。然后，修改上面的代码：

st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]
idx = st[:-1]
cnt = np.ediff1d(st)

然后，

>>> print(idx)
[0 2 5 6]
>>> print(cnt)
[2 3 1 2]

编辑2（时间测试）

In [69]: a = np.sort(np.random.randint(1, 10000, 10000))

In [70]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
240 µs ± 7.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [71]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
74.3 µs ± 816 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

确定有序numpy整数数组

2 个答案:

编辑（块长度）：

编辑2（时间测试）