Question

我很难找到在Python列表中查找索引的有效解决方案。到目前为止，我测试的所有解决方案都比找到的解决方案要慢。在MATLAB中的功能。我刚刚开始使用Python（因此，我不是很有经验）。

在MATLAB中我会使用以下内容：

a = linspace(0, 1000, 1000); % monotonically increasing vector
b = 1000 * rand(1, 100); % 100 points I want to find in a
for i = 1 : numel(b)
    indices(i) = find(b(i) <= a, 1); % find the first index where b(i) <= a
end

如果我使用MATLAB的arrayfun（），我可以加快这个过程。在Python中我尝试了几种可能性。我用了

for i in xrange(0, len(b)):
   tmp = numpy.where(b[i] <= a)
   indices.append(tmp[0][0])

需要花费很多时间，特别是如果a很大的话。如果b的分类比我可以使用

for i in xrange(0, len(b)):
    if(b[curr_idx] <= a[i]):
        indices.append(i)
        curr_idx += 1
    if(curr_idx >= len(b)):
        return indices
        break

这比numpy.where（）解决方案快得多，因为我只需要在列表中搜索一次，但这仍然比MATLAB解决方案慢。

有人能建议更好/更有效的解决方案吗？提前致谢。

Answer 1

尝试numpy.searchsorted：

>> a = np.array([0, 1, 2, 3, 4, 5, 6, 7])
>> b = np.array([1, 2, 4, 3, 1, 0, 2, 9])
% sorting b "into" a
>> np.searchsorted(a, b, side='right')-1
array([1, 2, 4, 3, 1, 0, 2, 9])

您可能必须对b中的值进行一些特殊处理，超出范围 - 例如上例中的9。尽管如此，这应该比任何基于循环的方法更快。

暂且不说：同样，MATLAB中的histc将比循环快得多。

修改

如果您希望得到b最接近a的索引，您应该能够使用相同的代码，只需使用修改后的代码：

>> a_mod = 0.5*(a[:-1] + a[1:]) % take the centers between the elements in a >> np.searchsorted(a_mod, np.array([0.9, 2.1, 4.2, 2.9, 1.1]), side='right') array([1, 2, 4, 3, 1])

请注意，您可以删除-1，因为a_mod只有一个元素小于a。

Answer 2

numpy仅用于生成数字（不适用于矢量化）：

import numpy as np
a = np.linspace(0, 1000, 1000)
b = 1000 * np.random.rand(100)
indices = [next(i for i, ai in enumerate(a) if bi <= ai) for bi in b]

如果示例中a.max()＆gt; = b.max()，则会有效，否则会引发StopIteration，但仍然很慢（尽管这一点并不像在b(i) <= a）。

如果您需要将索引作为数组而不是列表，请在此之后使用np.array(indices)。如果您需要进行一些优化，则可以对b进行排序并仅保留一个enumerate(a)，而不是采用最后一个元素。

你也可以在pypy上尝试没有numpy：

def igen(a, b):
    iterb = iter(b)
    bi = next(iterb)
    for i, ai in enumerate(a):
        while bi <= ai:
            yield i
            bi = next(iterb)
    i += 1 # Last bi are bigger than all ai
    yield i
    for unused in iterb:
        yield i

from random import random
a = (i * 1000. / 999. for i in xrange(43032500))
b = sorted(random() * 1000 for unused in xrange(3848))
indices = list(igen(a, b))

这个基于使用该想法的生成器，并且应该对b进行排序。对于所有len(a) bi > ai，这将返回ai。

进行测试，我正在使用：

setup = """
from random import random

def igen(a, b):
    iterb = iter(b)
    bi = next(iterb)
    for i, ai in enumerate(a):
        while bi <= ai:
            yield i
            bi = next(iterb)
    i += 1 # Last bi are bigger than all ai
    yield i
    for unused in iterb:
        yield i
"""

program = """
a = (i * 1000. / 999. for i in xrange(43032500))
b = sorted(random() * 1000 for unused in xrange(3848))
indices = list(igen(a, b))
"""

# Python 2 and 3 compatibility
import sys
if sys.version_info.major == 3:
    program = program.replace("xrange", "range")

# Time it! =)
from timeit import timeit
print(timeit(program, setup, number=5000))

这意味着我在每个环境中运行了5000倍的算法。得到的时间是所有试验（program）持续时间（不是平均值）的总和：

在CPython 3.4.0上，结果为11.491293527011294（秒）
在CPython 2.7.6上，结果为9.39319992065（秒）
On Pypy 2.2.1结果为3.31203603745（秒）

更具体的版本消息：

Python 3.4.0（默认，2014年4月11日，13：05：11）[GCC 4.8.2] on linux
Python 2.7.6（默认，2014年3月22日，22：59：56）[bCC上的[GCC 4.8.2]
Python2.7.3（2.2.1 + dfsg-1，2013年11月28日，05：13：10）[在Linux2上使用GCC 4.8.2进行PyPy 2.2.1]

现在与改编的“两个ifs”版本相同（下面的代码）有结果：

在CPython 3.4.0上，结果为13.03860338096274（秒）
在CPython 2.7.6上，结果为10.7371659279（秒）
On Pypy 2.2.1结果为2.88891601562（秒）

Pypy找到了一种方法来优化你的版本，但仍然有一个区别，我已经测试了这个只计算一次“a”，而我的版本计算了“a”5000次。我运行的代码是：

setup = """
from random import random
a = [i * 1000. / 999. for i in xrange(43032500)]
"""

program = """
b = sorted(random() * 1000 for unused in xrange(3848))
curr_idx = 0
indices = []
for i in xrange(len(a)): # Why not for i, ai in enumerate(a)?
    if b[curr_idx] <= a[i]:
        indices.append(i)
        curr_idx += 1
    if curr_idx >= len(b):
        break
"""

# Python 2 and 3 compatibility
import sys
if sys.version_info.major == 3:
    setup = setup.replace("xrange", "range")
    program = program.replace("xrange", "range")

# Time it! =)
from timeit import timeit
print(timeit(program, setup, number=5000))

另一个版本只会将a分配到program，而不是将其保留在setup上，这样，Pypy时间会转到2102.06863689（是的，更多超过35分钟）。将事物存储在一个巨大的列表上真的很慢。将程序更改为：

a = (i * 1000. / 999. for i in xrange(43032500)) # A generator expression
[...]
for i, ai in enumerate(a):
    if b[curr_idx] <= ai:
    [...]

让我们回到3.11599397659秒与Pypy。在此版本中，a创建了5000次，但从未存储在列表中。另一方面，该功能之外的igen版本“硬编码”工作时间为3.17516112328秒，其中setup刚导入random和program为：

a = (i * 1000. / 999. for i in xrange(43032500))
b = sorted(random() * 1000 for unused in xrange(3848))
indices = []
iterb = iter(b)
try:
    bi = next(iterb)
    for i, ai in enumerate(a):
        while bi <= ai:
            indices.append(i)
            bi = next(iterb)
except StopIteration:
    pass
else:
    i += 1 # Last bi are bigger than all ai
    indices.append(i)
    for unused in iterb:
        indices.append(i)

无论如何，请A = len(a)和B = len(b)，这些是O[A + B.log(B)]算法（包括带有np.searchsorted的@sebastian解决方案）。另一方面，计算所有对bi <= ai的{{1}}是(bi, ai)，Matlab解决方案应该渐近慢，除非它做一些内部优化以避免完全比较，使每个语句完全懒惰（但是我没有Matlab验证= /）。作为比较的需要，我在GNU Octave上做了这个：

O[b * a]

这是Python使用此问题的原始代码进行5000次的过程，并且它发生在start = time; a = linspace(0, 1000, 43032500); b = 1000 * rand(1, 3848); for i = 1 : numel(b) indices(i) = find(b(i) <= a, 1); end stop = time; stop - start秒（超过3分钟）。

哦，但是你在作弊！把那个“开始=时间”;在分配给“a”之后！

好的，没人说，但我刚试过这样的改变。由于每个203.16是一个大小为43032500的向量，因此变化不大：b(i) <= a秒。

和Numpy？！

Numpy也必须存储数据。大多数情况下，它不适用于生成器（hstack和vstack是例外）。但我们无法确定哪些更快，没有经验证据。让我们用Numpy 1.8.1运行它：

202.83

在CPython 2.7上，setup = """ import numpy as np a = np.linspace(0., 1000., 43032500) # Don't count this time """ program = """ b = 1000 * np.random.rand(3848) indices = np.searchsorted(a, b, side='right') - 1 # From @sebastian solution indices[b > a[-1]] = len(a) # Big value correction (my improvement) """ # Time it! =) from timeit import timeit print(timeit(program, setup, number=5000))秒
在CPython 3.4上，9.81494688988秒

就是这样。 =）

在Python列表中有效查找索引（与MATLAB相比）

2 个答案: