Question

我终于在我的代码中发现了性能瓶颈，但对于原因是什么感到困惑。为了解决这个问题，我更改了numpy.zeros_like的所有来电，改为使用numpy.zeros。但为什么zeros_like太慢了？

例如（e-05来电时注意zeros）：

>>> timeit.timeit('np.zeros((12488, 7588, 3), np.uint8)', 'import numpy as np', number = 10)
5.2928924560546875e-05
>>> timeit.timeit('np.zeros_like(x)', 'import numpy as np; x = np.zeros((12488, 7588, 3), np.uint8)', number = 10)
1.4402990341186523

但奇怪的是，写入使用zeros创建的数组明显慢于使用zeros_like创建的数组：

>>> timeit.timeit('x[100:-100, 100:-100] = 1', 'import numpy as np; x = np.zeros((12488, 7588, 3), np.uint8)', number = 10)
0.4310588836669922
>>> timeit.timeit('x[100:-100, 100:-100] = 1', 'import numpy as np; x = np.zeros_like(np.zeros((12488, 7588, 3), np.uint8))', number = 10)
0.33325695991516113

我猜是zeros正在使用一些CPU技巧而不是实际写入内存来分配它。这是在写入时动态完成的。但这仍然无法解释数组创建时间的巨大差异。

我正在使用当前的numpy版本运行Mac OS X Yosemite：

>>> numpy.__version__
'1.9.1'

Answer 1

我在Ipython中的时间安排（使用更简单的timeit界面）：

In [57]: timeit np.zeros_like(x)
1 loops, best of 3: 420 ms per loop

In [58]: timeit np.zeros((12488, 7588, 3), np.uint8)
100000 loops, best of 3: 15.1 µs per loop

当我用IPython（np.zeros_like??）查看代码时，我看到：

res = empty_like(a, dtype=dtype, order=order, subok=subok)
multiarray.copyto(res, 0, casting='unsafe')

而np.zeros是黑盒子 - 纯编译代码。

empty的时间是：

In [63]: timeit np.empty_like(x)
100000 loops, best of 3: 13.6 µs per loop

In [64]: timeit np.empty((12488, 7588, 3), np.uint8)
100000 loops, best of 3: 14.9 µs per loop

因此zeros_like中的额外时间是copy。

在我的测试中，分配时间（x[]=1）的差异可以忽略不计。

我的猜测是zeros，ones，empty都是早期编译的创作。为方便起见，添加了empty_like，只是从输入中绘制形状和输入信息。编写zeros_like时更注重简化编程维护（重用empty_like）而不是速度。

np.ones和np.full也使用np.empty ... copyto序列，并显示相似的时间。

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/array_assign_scalar.c 似乎是将标量（例如0）复制到数组的文件。我没有看到使用memset。

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/alloc.c已拨打malloc和calloc。

https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c - zeros和empty的来源。两者都致电PyArray_NewFromDescr_int，但最终会使用npy_alloc_cache_zero而另一个npy_alloc_cache。

npy_alloc_cache来电alloc.c中的{p> alloc。 npy_alloc_cache_zero调用npy_alloc_cache后跟memset。 alloc.c中的代码与THREAD选项进一步混淆。

有关calloc v malloc+memset差异的更多信息： Why malloc+memset is slower than calloc?

但是通过缓存和垃圾收集，我想知道calloc/memset区别是否适用。

这个使用memory_profile包的简单测试支持zeros和empty即时分配内存的声明，而zeros_like预先分配一切：

N = (1000, 1000) 
M = (slice(None, 500, None), slice(500, None, None))

Line #    Mem usage    Increment   Line Contents
================================================
     2   17.699 MiB    0.000 MiB   @profile
     3                             def test1(N, M):
     4   17.699 MiB    0.000 MiB       print(N, M)
     5   17.699 MiB    0.000 MiB       x = np.zeros(N)   # no memory jump
     6   17.699 MiB    0.000 MiB       y = np.empty(N)
     7   25.230 MiB    7.531 MiB       z = np.zeros_like(x) # initial jump
     8   29.098 MiB    3.867 MiB       x[M] = 1     # jump on usage
     9   32.965 MiB    3.867 MiB       y[M] = 1
    10   32.965 MiB    0.000 MiB       z[M] = 1
    11   32.965 MiB    0.000 MiB       return x,y,z

Answer 2

现代操作系统虚拟地分配内存，即，仅在首次使用时才将内存提供给进程。 zeros从操作系统获取内存，以便操作系统在首次使用时将其归零。另一方面，zeros_like用自己的零填充分配的内存。这两种方式都需要相同数量的工作 - 只需要zeros_like预先完成归零，而zeros最终会在运行中完成。

从技术上讲，在C语言中，区别在于调用calloc与malloc+memset。

为什么numpy.zeros和numpy.zeros_like之间的性能差异？

2 个答案: