Question

我需要遍历一个numpy数组，然后执行以下搜索。以下是大约60（s）的数组（下例中的npArray1和npArray2），其值约为300K。

换句话说，我正在寻找npArray2中第一次出现的索引对于npArray1的每个值。

for id in np.nditer(npArray1):                    
       newId=(np.where(npArray2==id))[0][0]

无论如何我可以使用numpy更快地完成上述操作吗？我需要在更大的阵列（50M）上运行上面的脚本。请注意，上面两行中的两个numpy数组，npArray1和npArray2的大小不一定相同，但它们都是1d。

非常感谢你的帮助，

Answer 1

函数np.unique将为您完成大部分工作：

npArray2 = np.random.randint(100,None,(1000,)) #1000-long vector of ints between 1 and 100, so lots of repeats
vals,idxs = np.unique(searchMe, return_index=True) #each unique value AND the index of its first appearance
for val in npArray1:
  newId = idxs[vals==val][0]

vals是一个数组，其中包含npArray2中的唯一值，而idxs则为npArray2中每个值的首次出现的索引。在vals中搜索应该比在npArray1中搜索速度快得多，因为它会更小。

您可以利用vals已排序的事实进一步加快搜索速度：

import bisect  #we can use binary search since vals is sorted
for val in npArray1:
    newId = idxs[bisect.bisect_left(vals, val)]

Answer 2

假设输入数组包含唯一值，您可以将np.searchsorted及其可选sorter选项用于矢量化解决方案，如此 -

arr2_sortidx = npArray2.argsort()
idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
out1 = arr2_sortidx[idx]

运行示例以验证输出 -

In [154]: npArray1
Out[154]: array([77, 19,  0, 69])

In [155]: npArray2
Out[155]: array([ 8, 33, 12, 19, 77, 30, 81, 69, 20,  0])

In [156]: out = np.empty(npArray1.size,dtype=int)
     ...: for i,id in np.ndenumerate(npArray1):
     ...:     out[i] = (np.where(npArray2==id))[0][0]
     ...:     

In [157]: arr2_sortidx = npArray2.argsort()
     ...: idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
     ...: out1 = arr2_sortidx[idx]
     ...: 

In [158]: out
Out[158]: array([4, 3, 9, 7])

In [159]: out1
Out[159]: array([4, 3, 9, 7])

运行时测试 -

In [175]: def original_app(npArray1,npArray2):
     ...:     out = np.empty(npArray1.size,dtype=int)
     ...:     for i,id in np.ndenumerate(npArray1):
     ...:         out[i] = (np.where(npArray2==id))[0][0] 
     ...:     return out
     ...: 
     ...: def searchsorted_app(npArray1,npArray2):
     ...:   arr2_sortidx = npArray2.argsort()
     ...:   idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
     ...:   return arr2_sortidx[idx]
     ...: 

In [176]: # Setup inputs
     ...: M,N = 50000,40000 # npArray2 and npArray1 sizes respectively
     ...: maxn = 200000
     ...: npArray2 = np.unique(np.random.randint(0,maxn,(M)))
     ...: npArray2 = npArray2[np.random.permutation(npArray2.size)]
     ...: npArray1 = npArray2[np.random.permutation(npArray2.size)[:N]]
     ...: 

In [177]: out1 = original_app(npArray1,npArray2)

In [178]: out2 = searchsorted_app(npArray1,npArray2)

In [179]: np.allclose(out1,out2)
Out[179]: True

In [180]: %timeit original_app(npArray1,npArray2)
1 loops, best of 3: 3.14 s per loop

In [181]: %timeit searchsorted_app(npArray1,npArray2)
100 loops, best of 3: 17.4 ms per loop

Answer 3

在您指定的任务中，您必须以这种或那种方式迭代数组。因此，您可以在不改变算法的情况下考虑相当大的性能提升。这是numba可能有很大帮助的地方：

import numpy as np
from numba import jit

@jit
def numba_iter(npa1, npa2):
    for id in np.nditer(npa1):                    
        newId=(np.where(npa2==id))[0][0]

这种简单的方法可能会使您的程序更快。查看一些示例和基准here。

在Numpy数组中循环和搜索

3 个答案: