MWE：

Question

想象一下，我们有一组唯一的整数。给定该列表的整数（N），我希望能够尽快在数组中获取其索引（I）。

我的想法是生成一个给定N返回I的对象。我虽然使用数据类型为(N,I)且按N排序的结构化数组，或仅使用带有键N的字典。

两种方法的搜索速度似乎与对象的大小无关，这使我相信它们受到开销的控制。但是，我有点惊讶地发现搜索字典的速度比搜索结构化数组快了近10倍。所以我的问题是：

为什么字典比我的数组实现快得多？
是否有比这两种方法更快的替代方法？

MWE：

from __future__ import division
import numpy as np
import timeit

#Time a function
def Timeme(funct,var,NN=10,NNN=10):
    for i in xrange(NN):
        start =timeit.default_timer()
        for t in xrange(NNN):
            funct(*var)
        end =timeit.default_timer()
        print str(i)+': '+str((end - start)/NNN*1000)  

#Function to build a dictionary        
def mydict(Flist):
    Mydict=dict()
    for n,i in Flist:
        Mydict[n]=i
    return Mydict

#Functions to access the data
def myfd(Mydict,vtest):
    return Mydict[vtest]

def myfs(Flist,vtest):
    n=Flist['N'].searchsorted(vtest)
    return Flist['I'][n] #Flist[n]['I'] is slower

#N=100000  
N=100

# "Allocate empty structured array"
Flist=np.empty(N,dtype=[('N','i4'),('I','i4')])

# "Fill N with randoms and I with sequence"
Flist['N'] = np.random.randint(N*1000,size=N)
Flist['I'] = np.arange(N)

# "Create test value"
ntest=np.random.randint(N)
vtest=Flist['N'][ntest]

# "Sort array on N"
Flist.sort(order='N')

# "Make dictionary"
Mydict=dict(Flist)

# "Get values"    
nrd=myfd(Mydict,vtest)
nrs=myfs(Flist,vtest)

print "Tests OK: " + str(ntest == nrd and ntest == nrs) 

print "\nSearch with Dictionary:"
Timeme(myfd,[Mydict,vtest],NN=5,NNN=100)
print "\nSearch directly in Array:"
Timeme(myfs,[Flist,vtest],NN=5,NNN=100)

结果：

Tests OK: True

Search with Dictionary:
0: 0.000404204885682
1: 0.000409016848607
2: 0.000418640774457
3: 0.000404204885682
4: 0.000394580959833

Search directly in Array:
0: 0.00455211692685
1: 0.00465798011119
2: 0.00458580066732
3: 0.00464354422242
4: 0.00476384329554

Answer 1

这可以部分地通过方法调用/函数调用开销来解释。您的字典搜索功能仅执行单个操作索引，该操作被转换为对my_dict.__getitem__(key)的调用，而基于数组的实现最终会调用3个方法.searchsorted和__getitem__两次。 Python是一种动态语言，函数调用，特别是方法调用（因为方法解析）很昂贵。

但从根本上说，基于dict的实施应该可以更好地扩展。 Python dict对象通常是具有恒定时间搜索的高度优化的哈希映射。基于数组的实现是二进制搜索，因此它是O（log（n））。您将在测试用例中看到这一点，您可以选择最坏情况，即搜索不在数组中的元素。鉴于searchsorted以对数方式进行缩放，您可能需要在看到显着的运行时效果之前显着增加数组的大小（例如，100x，1000x）。

在Python中，你绝对没有机会实现比内置dict更快的查找。

搜索字典Vs搜索排序的numpy结构化数组

MWE：

结果：

1 个答案: