在Numpy数组中循环和搜索

时间:2016-03-11 11:07:19

标签: python arrays performance numpy

我需要遍历一个numpy数组,然后执行以下搜索。以下是大约60(s)的数组(下例中的npArray1和npArray2),其值约为300K。

换句话说,我正在寻找npArray2中第一次出现的索引 对于npArray1的每个值。

for id in np.nditer(npArray1):                    
       newId=(np.where(npArray2==id))[0][0] 

无论如何我可以使用numpy更快地完成上述操作吗?我需要在更大的阵列(50M)上运行上面的脚本。请注意,上面两行中的两个numpy数组,npArray1和npArray2的大小不一定相同,但它们都是1d。

非常感谢你的帮助,

3 个答案:

答案 0 :(得分:1)

函数np.unique将为您完成大部分工作:

npArray2 = np.random.randint(100,None,(1000,)) #1000-long vector of ints between 1 and 100, so lots of repeats
vals,idxs = np.unique(searchMe, return_index=True) #each unique value AND the index of its first appearance
for val in npArray1:
  newId = idxs[vals==val][0]

vals是一个数组,其中包含npArray2中的唯一值,而idxs则为npArray2中每个值的首次出现的索引。在vals中搜索应该比在npArray1中搜索速度快得多,因为它会更小。

您可以利用vals已排序的事实进一步加快搜索速度:

import bisect  #we can use binary search since vals is sorted
for val in npArray1:
    newId = idxs[bisect.bisect_left(vals, val)]

答案 1 :(得分:1)

假设输入数组包含唯一值,您可以将np.searchsorted及其可选sorter选项用于矢量化解决方案,如此 -

arr2_sortidx = npArray2.argsort()
idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
out1 = arr2_sortidx[idx]

运行示例以验证输出 -

In [154]: npArray1
Out[154]: array([77, 19,  0, 69])

In [155]: npArray2
Out[155]: array([ 8, 33, 12, 19, 77, 30, 81, 69, 20,  0])

In [156]: out = np.empty(npArray1.size,dtype=int)
     ...: for i,id in np.ndenumerate(npArray1):
     ...:     out[i] = (np.where(npArray2==id))[0][0]
     ...:     

In [157]: arr2_sortidx = npArray2.argsort()
     ...: idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
     ...: out1 = arr2_sortidx[idx]
     ...: 

In [158]: out
Out[158]: array([4, 3, 9, 7])

In [159]: out1
Out[159]: array([4, 3, 9, 7])

运行时测试 -

In [175]: def original_app(npArray1,npArray2):
     ...:     out = np.empty(npArray1.size,dtype=int)
     ...:     for i,id in np.ndenumerate(npArray1):
     ...:         out[i] = (np.where(npArray2==id))[0][0] 
     ...:     return out
     ...: 
     ...: def searchsorted_app(npArray1,npArray2):
     ...:   arr2_sortidx = npArray2.argsort()
     ...:   idx = np.searchsorted(npArray2,npArray1,sorter=arr2_sortidx)
     ...:   return arr2_sortidx[idx]
     ...: 

In [176]: # Setup inputs
     ...: M,N = 50000,40000 # npArray2 and npArray1 sizes respectively
     ...: maxn = 200000
     ...: npArray2 = np.unique(np.random.randint(0,maxn,(M)))
     ...: npArray2 = npArray2[np.random.permutation(npArray2.size)]
     ...: npArray1 = npArray2[np.random.permutation(npArray2.size)[:N]]
     ...: 

In [177]: out1 = original_app(npArray1,npArray2)

In [178]: out2 = searchsorted_app(npArray1,npArray2)

In [179]: np.allclose(out1,out2)
Out[179]: True

In [180]: %timeit original_app(npArray1,npArray2)
1 loops, best of 3: 3.14 s per loop

In [181]: %timeit searchsorted_app(npArray1,npArray2)
100 loops, best of 3: 17.4 ms per loop

答案 2 :(得分:0)

在您指定的任务中,您必须以这种或那种方式迭代数组。因此,您可以在不改变算法的情况下考虑相当大的性能提升。这是numba可能有很大帮助的地方:

import numpy as np
from numba import jit

@jit
def numba_iter(npa1, npa2):
    for id in np.nditer(npa1):                    
        newId=(np.where(npa2==id))[0][0]

这种简单的方法可能会使您的程序更快。查看一些示例和基准here