Question

我有两个列表，例如：

aa=[int(1000*random.random()) for i in xrange(10000)]
bb=[int(1000*random.random()) for i in xrange(10000)]

我会在另一个列表中告诉我列表bb中的项目是aa;如果它不存在，那么我希望它返回-1。

这些名单可能很庞大，而且必须运行数千次，所以即使加速也会很快。

到目前为止，我能找到的最快的是：

def index_withoutexception(aa,bb):
    try:
        return aa.index(bb)
    except:
        return -1
ls = [index_withoutexception(bb,i) for i in aa]

有更快的方法来实现这个目标吗？

n.b。 if语句的问题是我无法找到一个返回nan / -1的函数，它们都会抛出异常，这就是慢点...我收集

Answer 1

numpy_indexed包可用于以完全向量化的方式解决此问题（免责声明：我是其作者）。请注意，您也可以使用numpy替换其余代码，否则必然会成为瓶颈。

import numpy_indexed as npi
i = npi.indices(aa, bb, missing='mask').filled(-1)

Answer 2

这是一种基于np.searchsorted并受this other post启发的方法 -

sidx = np.argsort(bb)
L = np.searchsorted(bb,aa,sorter=sidx,side='left')
R = np.searchsorted(bb,aa,sorter=sidx,side='right')
out = np.where(L != R,sidx[L],-1)

请注意，如果bb已经排序，您可以跳过sidx的计算，并且可以删除sidx的所有其他部分，从而提高效果。这种情况的缩短代码是 -

L = np.searchsorted(bb,aa,side='left')
R = np.searchsorted(bb,aa,side='right')
out = np.where(L != R,L,-1)

另请注意，输出将是NumPy数组。如果绝对需要作为列表输出，您可以执行out.tolist()。

运行时测试

让我们提出针对原始循环版本的方法。

1]设置输入：

In [171]: import numpy as np
     ...: 
     ...: # Create random unique lists
     ...: 
     ...: # 1. Random elements
     ...: aa=[int(1000*np.random.random()) for i in xrange(10000)]
     ...: bb=[int(1000*np.random.random()) for i in xrange(10000)]
     ...: 
     ...: # 2. Unique elements
     ...: aa = np.unique(aa)
     ...: bb = np.unique(bb)
     ...: 
     ...: # 3. Since np.unique sorts the elements, let's randomize them
     ...: aa = aa[np.random.permutation(aa.size)]
     ...: bb = bb[np.random.permutation(bb.size)]
     ...: 
     ...: #4. Finall make lists from the arrays
     ...: aa = aa.tolist()
     ...: bb = bb.tolist()
     ...:

2]定义循环和矢量化版本：

In [172]: def index_withoutexception(aa,bb):
     ...:     try:
     ...:         return aa.index(bb)
     ...:     except:
     ...:         return -1
     ...:     

In [173]: def vectorized_approach(aa,bb):
     ...:     sidx = np.argsort(bb)
     ...:     L = np.searchsorted(bb,aa,sorter=sidx,side='left')
     ...:     R = np.searchsorted(bb,aa,sorter=sidx,side='right')
     ...:     return np.where(L != R,sidx[L],-1)
     ...:

3]最后验证并计算结果：

In [174]: out1 = [index_withoutexception(bb,i) for i in aa]

In [175]: out2 = vectorized_approach(aa,bb)

In [176]: np.allclose(out1,out2)
Out[176]: True

In [177]: %timeit [index_withoutexception(bb,i) for i in aa]
100 loops, best of 3: 11.6 ms per loop

In [178]: %timeit vectorized_approach(aa,bb)
1000 loops, best of 3: 780 µs per loop

Answer 3

您可以创建dict或defaultdict(list)，将每个元素映射到它出现的索引（或索引）。这样，你需要更多空间（比原始列表更多，但仍然在同一个球场），但是一旦创建了dict，每个索引查找将是O（1）。

>>> lst = [random.randint(0, 100) for _ in range(100)]
>>> indices = collections.defaultdict(list)
>>> for i, e in enumerate(lst):
...     indices[e].append(i)
...
>>> indices[30]
[21, 28, 89]

应用于您的具体问题，您可以尝试这样的事情：

>>> aa = [random.randint(0, 10) for _ in range(20)] # [3, 9, 4, 5, 6, 5, 2, 4, 7, 4, 4, 9, 10, 8, 8, 7, 6, 3, 3, 3]
>>> bb = [random.randint(0, 10) for _ in range(20)] # [10, 7, 4, 9, 8, 4, 10, 7, 9, 1, 4, 8, 8, 3, 8, 0, 1, 10, 1, 6]
>>> aa_indices = {e: i for (i, e) in reversed(list(enumerate(aa)))} # {2: 6, 3: 0, 4: 2, 5: 3, 6: 4, 7: 8, 8: 13, 9: 1, 10: 12}
>>> b_in_a = [aa_indices.get(b, -1) for b in bb]
>>> b_in_a
[12, 8, 2, 1, 13, 2, 12, 8, 1, -1, 2, 13, 13, 0, 13, -1, -1, 12, -1, 4]

注意：这是使用reversed，否则字典将包含给定元素的 last 索引。

使用IPython %timeit的一些时序分析：这种方法在创建dict时只需2.24 ms，而最终列表只需2.88 ms，而原始方法则为173 ms。

>>> %timeit [index_withoutexception(bb,i) for i in aa]
10 loops, best of 3: 173 ms per loop
>>> %timeit bb_indices = {e: i for (i, e) in reversed(list(enumerate(bb)))}
100 loops, best of 3: 2.24 ms per loop
>>> %timeit [bb_indices.get(i, -1) for i in aa]
100 loops, best of 3: 2.88 ms per loop

Python从列表a中查找列表b中的项目索引

3 个答案:

运行时测试