我有一个排序的整数数组,例如[0, 0, 1, 1, 1, 2, 4, 4]
,我想确定整数块的起始位置和块的长度。块大小很小但阵列本身可能非常大,因此效率很重要。块的总数也是已知的。
numpy.unique
可以解决问题:
import numpy
a = numpy.array([0, 0, 1, 1, 1, 2, 4, 4])
num_blocks = 4
print(a)
_, idx_start, count = numpy.unique(a, return_index=True, return_counts=True)
print(idx_start)
print(count)
[0 0 1 1 1 2 4 4]
[0 2 5 6]
[2 3 1 2]
但很慢。我认为,鉴于输入数组的特定结构,有一个更有效的解决方案。
例如,像
这样简单的事情import numpy
a = numpy.array([0, 0, 1, 1, 1, 2, 3, 3])
num_blocks = 4
k = 0
z = a[k]
block_idx = 0
counts = numpy.empty(num_blocks, dtype=int)
count = 0
while k < len(a):
if z == a[k]:
count += 1
else:
z = a[k]
counts[block_idx] = count
count = 1
block_idx += 1
k += 1
counts[block_idx] = count
print(counts)
给出了块大小,而简单的numpy.cumsum
会给出index_start
。当然,使用Python循环很慢。
任何提示?
答案 0 :(得分:4)
这是一个有掩饰和切片的人 -
def grp_start_len(a):
m = np.r_[True,a[:-1] != a[1:],True] #np.concatenate for a bit more boost
idx = np.flatnonzero(m)
return idx[:-1], np.diff(idx)
示例运行 -
In [18]: a
Out[18]: array([0, 0, 1, 1, 1, 2, 4, 4])
In [19]: grp_start_len(a)
Out[19]: (array([0, 2, 5, 6]), array([2, 3, 1, 2]))
Timings(来自@AGN Gazer&#39的解决方案) -
In [24]: np.random.seed(0)
In [25]: a = np.sort(np.random.randint(1, 10000, 10000))
In [26]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
1000 loops, best of 3: 411 µs per loop
# @AGN Gazer's solution
In [27]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
10000 loops, best of 3: 81.2 µs per loop
In [28]: %timeit grp_start_len(a)
10000 loops, best of 3: 60.1 µs per loop
更多尺寸10x
-
In [40]: np.random.seed(0)
In [41]: a = np.sort(np.random.randint(1, 100000, 100000))
In [42]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
...: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
...: %timeit grp_start_len(a)
100 loops, best of 3: 5.34 ms per loop
1000 loops, best of 3: 792 µs per loop
1000 loops, best of 3: 463 µs per loop
答案 1 :(得分:3)
np.where(np.ediff1d(a, None, a[0]))[0]
如果您想在答案中使用第一个“0”,请在a[0]
添加非零数字:
np.where(np.ediff1d(a, None, a[0] + 1))[0]
啊,只是注意到你也希望得到块长度。然后,修改上面的代码:
st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]
idx = st[:-1]
cnt = np.ediff1d(st)
然后,
>>> print(idx)
[0 2 5 6]
>>> print(cnt)
[2 3 1 2]
In [69]: a = np.sort(np.random.randint(1, 10000, 10000))
In [70]: %timeit _, idx_start, count = np.unique(a, return_index=True, return_counts=True)
240 µs ± 7.44 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [71]: %timeit st = np.where(np.ediff1d(a, a[-1] + 1, a[0] + 1))[0]; idx = st[:-1]; cnt = np.ediff1d(st)
74.3 µs ± 816 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)