Question

我正在分析一些代码，发现结果令np.where()感到惊讶。我想在数组的一部分上使用where()（知道2D数组的很大一部分与我的搜索无关），并发现它是代码中的瓶颈。作为测试，我创建了一个新的2D数组作为该切片的副本，并测试了where()的速度。事实证明，它的运行速度明显更快。在我的实际情况中，速度的提高确实非常显着，但是我认为此测试代码仍然可以证明我的发现：

import numpy as np

def where_on_view(arr):
    new_arr = np.where(arr[:, 25:75] == 5, arr[:, 25:75], np.NaN)

def where_on_copy(arr):
    copied_arr = arr[:, 25:75].copy()
    new_arr = np.where(copied_arr == 5, copied_arr, np.NaN)

arr = np.random.choice(np.arange(10), 1000000).reshape(1000, 1000)

timeit的结果是：

%timeit where_on_view(arr)
398 µs ± 2.82 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit where_on_copy(arr)
295 µs ± 6.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

由于这两个方法都返回一个新数组，因此我尚不清楚如何预先获取切片的完整副本如何将np.where()加速到这种程度。我还进行了一些健全性检查，以确认：

在这种情况下，它们都返回相同的结果。
where()搜索实际上仅限于切片，而不是检查整个数组，然后过滤输出。

这里：

# Sanity check that they do give the same output

test_arr = np.random.choice(np.arange(3), 25).reshape(5, 5)
test_arr_copy = test_arr[:, 1:3].copy()

print("No copy")
print(np.where(test_arr[:, 1:3] == 2, test_arr[:, 1:3], np.NaN))
print("With copy")
print(np.where(test_arr_copy == 2, test_arr_copy, np.NaN))

# Sanity check that it doesn't search the whole array

def where_on_full_array(arr):
    new_arr = np.where(arr == 5, arr, np.NaN)

#%timeit where_on_full_array(arr)
#7.54 ms ± 47.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

我很好奇这种情况下增加的开销来自何处？

Answer 1

以下是一些源代码片段，这些片段至少部分地解释了观察结果。我不考虑where，因为差异似乎是之前创建的。相反，我一般是在看ufuncs。

ufuncs的基本功能

暂时忽略一些特殊的套管函数，这可能是由覆盖其他尺寸的外部循环中可能最优化的最内部一维循环计算出来的。

外部循环比较昂贵，它使用numpy nditer，因此必须对其进行设置，并且对于每次迭代调用iternext（它是一个函数指针），因此都不需要内联。

通过比较，内部循环是一个简单的C循环。

严格的ufunc评估会产生大量开销

来自numpy / core / src / umath / ufunc_object.c中包含的numpy / core / src / private / lowlevel_strided_loops.h

/*
 *            TRIVIAL ITERATION
 *
 * In some cases when the iteration order isn't important, iteration over
 * arrays is trivial.  This is the case when:
 *   * The array has 0 or 1 dimensions.
 *   * The array is C or Fortran contiguous.
 * Use of an iterator can be skipped when this occurs.  These macros assist
 * in detecting and taking advantage of the situation.  Note that it may
 * be worthwhile to further check if the stride is a contiguous stride
 * and take advantage of that.

因此，我们看到具有连续参数的ufunc可以通过对快速内部循环的一次调用来评估，而完全绕过外部循环。

要了解复杂性和开销的区别，请查看numpy / core / src / umath / ufunc_object.c中的函数trivial_two/three_operand_loop和iterator_loop，以及numpy中的所有npyiter_iternext_*函数/core/src/multiarray/nditer_templ.c

strufed ufunc eval比strided copy昂贵

来自自动生成的numpy / core / src / multiarray / lowlevel_strided_loops.c

/*
 * This file contains low-level loops for copying and byte-swapping
 * strided data.
 *

此文件将近25万行。

相比之下，还自动生成的文件numpy / core / src / umath / loops.c提供了最底层的ufunc循环，大约只有1.5万行。

这本身表明复制可能比ufunc评估更优化。

这里相关的是宏

/* Start raw iteration */
#define NPY_RAW_ITER_START(idim, ndim, coord, shape) \
        memset((coord), 0, (ndim) * sizeof(coord[0])); \
        do {

[...]

/* Increment to the next n-dimensional coordinate for two raw arrays */
#define NPY_RAW_ITER_TWO_NEXT(idim, ndim, coord, shape, \
                              dataA, stridesA, dataB, stridesB) \
            for ((idim) = 1; (idim) < (ndim); ++(idim)) { \
                if (++(coord)[idim] == (shape)[idim]) { \
                    (coord)[idim] = 0; \
                    (dataA) -= ((shape)[idim] - 1) * (stridesA)[idim]; \
                    (dataB) -= ((shape)[idim] - 1) * (stridesB)[idim]; \
                } \
                else { \
                    (dataA) += (stridesA)[idim]; \
                    (dataB) += (stridesB)[idim]; \
                    break; \
                } \
            } \
        } while ((idim) < (ndim))

由numpy / core / src / multiarray / array_assign_array.c中的函数raw_array_assign_array使用，该函数为Python ndarray.copy方法进行实际复制。

我们可以看到，与ufuncs使用的“完整迭代”相比，“原始迭代”的开销相当小。

为什么np.where（）在数组切片的副本上比在原始数组上的视图更快？

1 个答案:

ufuncs的基本功能

严格的ufunc评估会产生大量开销

strufed ufunc eval比strided copy昂贵