Question

我最近一直在尝试提高我用Python3.5（在Ubuntu 16.04上运行）编写的代码的性能（在这里是指处理时间）。我的代码执行了余弦傅里叶变换，最终我花了很多时间，所以要花很多很多小时...

我的笔记本电脑有些旧，所以我不相信多线程会有所帮助。无论如何，我对对计算本身进行编码以加快处理速度更感兴趣。这是我尝试进行改进的代码。

import numpy as np
import time
import math


#== Define my two large numpy arrays ==#
a = np.arange( 200000 )
b = np.arange( 200000 )


#===============#
#== First way ==#
#===============#

t1 = time.time()

#== Loop that performs 1D array calculation 50 times sequentially ==#
for i in range(0, 50):
    a * np.cos( 2 * math.pi * i * b )

t2 = time.time()
print( '\nLoop computation with 1D arrays: ', (t2-t1)*1000, ' ms' )


#================#
#== Second way ==#
#================#

t1 = time.time()

#== One liner to use 1D and 2D arrays at once ==#
a * np.cos( 2 * math.pi * ( np.arange( 50 ) )[:, None] * b )

t2 = time.time()
print( '\nOne liner using both 1D and 2D arrays at once: ', (t2-t1)*1000, ' ms\n' )

在这种情况下，我需要使用大型Numpy数组执行50次计算。我曾经使用循环来执行一维数组计算，以根据需要依次执行多次。

我最近尝试使用Numpy向量化的功能通过2D数组计算进行在线计算。事实证明，二维数组计算需要更多时间，因为输出显示：

Loop computation with 1D arrays:  354.66670989990234  ms

One liner using both 1D and 2D arrays at once:  414.03937339782715  ms

我没想到。也许考虑到大型数组，内存开销会减慢计算速度？还是我的笔记本电脑的CPU不堪重负？

所以我的问题是：在这种情况下，最有效/最快的方法是什么？

更新：我尝试了Subhaneil Lahiri的Numba建议，添加了以下代码行以两次调用它（仍然不存储任何结果）：

#===============#
#== Third way ==#
#===============#

t1 = time.time()

@nb.jit(cache=True)
def cos_matrix(a, b, niter):
    for i in range(niter):
        a * np.cos(2 * math.pi * i * b)

cos_matrix( a, b , 50 )

t2 = time.time()
print( '\nLoop computation using Numba and 1D arrays: ', (t2-t1)*1000, ' ms' )

t1 = time.time()

cos_matrix( a, b , 50 )

t2 = time.time()
print( '\nSecond call to loop computation using Numba and 1D arrays: ', (t2-t1)*1000, ' ms\n' )

不幸的是，它不能改善结果，如您所见：

Loop computation with 1D arrays:  366.67585372924805  ms

One liner using both 1D and 2D arrays at once:  417.5834655761719  ms

Loop computation using Numba and 1D arrays:  590.1947021484375  ms

Second call to loop computation using Numba and 1D arrays:  458.58097076416016  ms

非常感谢，安托万。

Answer 1

At first think about your input and output datatype. I assume that you want to do the calculation in double precision (float64), but single precision (float32) would be faster.

The second thing to consider is the implementation of the cosine function itself. Python uses by default the implementation it is linked to. In this example I use the Intel- SVML implementation. You may have to install it first, as described in the link.

Please also consider that it makes simply no sense to test a function without output. If you do this a compiler like Numba may optimize away the calculation you are trying to benchmark, or try to show the array on the command window which can take a significant amount of time.

Code

import numpy as np
import time
import math
import numba as nb

@nb.njit(fastmath=True,parallel=True)
def compute_numba(a,b,it):
  res=np.empty((it,a.shape[0]))
  ita=np.arange(0,it)

  for i in nb.prange(ita.shape[0]):
    it=ita[i]
    for j in range(a.shape[0]):
      res[i,j]=a[j] * np.cos( 2. * np.pi * it * b[j])
  return res

#== Define my two large numpy arrays ==#
#Your input type may be float64?
a = np.arange(200000).astype(np.float64)
b = np.arange(200000).astype(np.float64)


#===============#
#== First way ==#
#===============#

t1 = time.time()

#== Loop that performs 1D array calculation 50 times sequentially ==#
res=np.empty((50,a.shape[0]))
for i in range(0, 50):
    res[i,:]=a * np.cos( 2 * math.pi * i * b )

t2 = time.time()
print( '\nLoop computation with 1D arrays: ', (t2-t1)*1000, ' ms' )


#================#
#== Second way ==#
#================#

t1 = time.time()

#== One liner to use 1D and 2D arrays at once ==#
res=a * np.cos( 2 * math.pi * ( np.arange( 50 ) )[:, None] * b )

t2 = time.time()
print( '\nOne liner using both 1D and 2D arrays at once: ', (t2-t1)*1000, ' ms\n' )

#===============#
#== Third way ==#
#===============#
#Don't measure compilation overhead (You will call this functions multiple times?)
res=compute_numba(a,b,50)

t1 = time.time()
res=compute_numba(a,b,50)
t2 = time.time()
print( '\nLoop computation with Numba: ', (t2-t1)*1000, ' ms' )

Output

Core i5-8500

Loop computation with 1D arrays:  176.4671802520752  ms
One liner using both 1D and 2D arrays at once:  151.40032768249512  ms
Loop computation with Numba:  26.036739349365234  ms

Answer 2

有一些工具可以加快循环速度。我认为numba是最容易使用的。我听说cython是最有效但更难使用的工具，但我自己还没有尝试过。或者在极端情况下，您可以编写C扩展名。

Numba：http://numba.pydata.org Cython：https://cython.org

numba示例：

import numpy as np
import numba as nb

@nb.jit(cache=True)
def cos_matrix(a, b, niter):
    for i in range(niter):
        c = a * np.cos(2 * math.pi * i * b)
        # do something with c...
    return c

这会在第一次调用C时生成并编译C代码。

编辑：不是@ max9111指出的C代码，LLVM-IR代码

循环一维数组计算与二维数组计算的Python / Numpy性能

2 个答案: