加快在另一个索引列表(numpy)中查找下一个索引的速度

时间:2019-06-05 18:38:02

标签: python-3.x numpy cython numba

下面的代码可以按需工作,但是由于循环的缘故似乎没有优化。我已经能够成功向量化所有其他方法,但似乎无法弄清楚如何删除该方法上的循环。

Speedwise:当我有数百万行时,这成为一个问题。

是否有矢量化方法,还是应该尝试使用cython或numba?我一直在尝试限制使用的软件包数量。

示例代码:

id

3 个答案:

答案 0 :(得分:3)

leading中的每个元素与within中的所有元素相减的最大值为leadingwithin的最大值相减。因此,只需-

within.max() - leading

不需要额外的模块。

时间-

In [79]: np.random.seed(0)
    ...: within = np.random.rand(1000000)
    ...: leading = np.random.rand(400000)

In [80]: %timeit within.max() - leading
1000 loops, best of 3: 850 µs per loop

答案 1 :(得分:1)

使用numba,您可以对代码进行相当简单的翻译:

import numba as nb
import numpy as np

def find_leading(leading, within):
    # find first following elements in within array
    first_after_leading = []
    for _ in leading:
        temp = (within - _).max()

        first_after_leading.append(temp)

    # convert to np array
    first_after_leading = np.array(first_after_leading)

    return first_after_leading


@nb.jit(nopython=True)
def find_leading_nb(leading, within):
    # find first following elements in within array
    first_after_leading = np.empty_like(leading)
    for i, _ in enumerate(leading):
        temp = (within - _).max()

        first_after_leading[i] = temp

    return first_after_leading

然后使用原始输入:

%timeit find_leading(leading, within)
%timeit find_leading_nb(leading, within)
%timeit (within[:,None] - leading).max(0)
17.3 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
1.7 µs ± 25.3 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
6.48 µs ± 180 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

,然后使用一些更大的数组:

leading = np.random.randint(0, 100, (1000,))
within = np.random.randint(0, 100, (100000,))

%timeit find_leading(leading, within)
%timeit find_leading_nb(leading, within)
%timeit (within[:,None] - leading).max(0)
145 ms ± 3.82 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
67.4 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
553 ms ± 4.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

在MacOS python 3.7上使用numba 0.44和numpy 1.16.4运行的计时

编辑

但是,如果我正确地理解了您的算法,则更快的方法是只查找一次within的最大值,然后与leading进行区别,因此您不必查找循环中临时数组的max

@nb.jit(nopython=True)
def find_leading_nb2(leading, within):
    max_within = within.max()
    first_after_leading = np.empty_like(leading)
    for i, x in enumerate(leading):
        first_after_leading[i] = max_within - x
    return first_after_leading

您的原始输入内容如下:

%timeit find_leading_nb2(leading, within)
919 ns ± 8.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

,以及大输入中的以下内容:

%timeit find_leading_nb2(leading, within)
21.6 µs ± 180 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

答案 2 :(得分:0)

我想把它做成单面纸会有所帮助。试试吧。

first_after_leading =np.array([(within - _).max() for _ in leading])