Python中最快的搜索返回索引的元组/列表列表

时间:2014-10-01 17:55:12

标签: python performance list python-3.3

我有一个元组/列表的列表/元组(无论我使用哪一个),其中内部列表或元组的值具有可变大小。我需要检查变量是否在第一个插槽内部列表或元组中。

  

结构如下:

     

[[[in of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of of in ],...重复约20x]

示例:

  

([1,21,54,55,93,99,284,393,964,1029,1214,1216,1223,1253,1258,1334,1365,1394,1397,1453,1471,1543,1589 ,1824,1975,2054,2090,2164,2165,2166,2163,223,2547,2645,2802,2809,2931,2958,3031,3071,3077,3078,3189,3199,3202,3203],[1 ,1,1,1,1,1,1,1,2,1,2,1,1,2,1,1,1,2,4,2,1,1,1,1,1,1 ,1,1,2,2,4,2,1,1,1,1,1,2,3,1,2,1,3,3,1,2],[[3],[1] ,[4],[2],[12],[6],[3],[8],[20,27],[11],[4,7],[71],[133],[ [74,74],[6],[67],[34],[3,16],[9,7,23,71],[11,43],[67],[71],[4] ,[139],[16],[52],[4],[31],[7,50],[2,12],[1,1,81,114],[13,70],[ 60],[121],[30],[16],[214],[29,78],[9,37,60],[14],[216,249],[28],[2, 2,21],[4,18,22],[59],[8,24]])

     

这只是我的20k +元素列表中第一个相似的值。

所以我有一个功能来检查数字是否在:

  

[1,21,54,53,93,99,284,393,964,1029,1214,1216,1223,1253,1258,1334,1365,1394,1397,1453,1471,1543,1589, 1824,1975,2054,2090,2164,2165,2166,2163,2233,2547,2645,2802,2809,2931,2958,3031,3071,3077,3078,3189,3199,3202,3203]

它将返回索引。

我的功能: iD是我正在搜索的数字,发布只是我的嵌套循环中的第一个元素(上面的直接块是发布的一个例子)

def searchCurrentPosting(iD,posting):
x = 0

for each in posting[0]:
    if iD == each:
         return x
    x += 1
return False

每次给出一个新单词时,我必须运行这个搜索功能(20k到某个给定数字的幂)。此代码将运行大约一分钟。无论如何要缩短时间?

编辑:如果你想要我的整个程序,那就是:

这是我的主要推动因素:http://pastebin.com/Udjit7PP

它解析的文件是:CACM集合,它是IR测试的标准。

使用词根(端口词干):http://pastebin.com/AzA0fvdV

是的,我正在创建倒排索引。

1 个答案:

答案 0 :(得分:4)

由于您在索引0处的列表已排序,您可以使用bisect模块在​​O(log N)时间内查找索引:

In [33]: import bisect

In [34]: lst = [1, 21, 54, 55, 93, 99, 284, 393, 964, 1029, 1214, 1216, 1223, 1253, 1258, 1334, 1365, 1394, 1397, 1453, 1471, 1543, 1589, 1824, 1975, 2054, 2090, 2164, 2165, 2166, 2167, 2323, 2547, 2645, 2802, 2809, 2931, 2958, 3031, 3071, 3077, 3078, 3189, 3199, 3202, 3203]

In [35]: n = 2802

In [36]: ind = bisect.bisect_left(lst, n)

In [37]: if lst[ind] == n:
    ...:     print 'Item found at {}'.format(ind)
    ...:     
Item found at 34

请注意,如果列表未排序,则最好先对其进行排序并将引用存储在变量中,这样您就不必反复对其进行排序。

另一种选择是使用字典,其中项目为键,索引为值(对于重复项目,仅存储其第一次出现的索引,即类似于list.index)。创建字典后,您可以在O(1)时间获得项目索引。

In [38]: dct = {}

In [39]: for i, x in enumerate(lst):
    ...:     if x not in dct:
    ...:         dct[x] = i
    ...:         

In [40]: dct.get(n)
Out[40]: 34

In [41]: dct.get(1000) #return None for non-existent items

时间比较:

In [43]: lst = list(range(10**5))

In [44]: %timeit bisect.bisect_left(lst, 10**5-5)
1000000 loops, best of 3: 444 ns per loop

In [45]: %timeit lst.index(10**5-5)
1000 loops, best of 3: 1.29 ms per loop

In [46]: %timeit dct.get(10**5-5) #dct created using the new list.
10000000 loops, best of 3: 104 ns per loop

如果您要继续更新索引0处的列表并且它没有排序,那么您应该只使用list.index()而不是使用循环,字典或平分。

In [47]: try:
    ...:     ind = lst.index(n)
    ...:     print 'Item found at {}'.format(ind)
    ...: except IndexError:
    ...:     pass
    ...: 
Item found at 34