Question

在我的python代码中，我需要循环大约2500万次，我希望尽可能地对其进行优化。循环内的操作非常简单。为了使代码高效，我使用了numba模块，该模块有很大帮助，但如果可能的话，我想进一步优化代码。

这是一个完整的工作示例：

import numba as nb
import numpy as np
import time 
#######create some synthetic data for illustration purpose##################
size=5000
eps = 0.2
theta_c = 0.4
temp = np.ones(size)
neighbour = np.random.randint(size, size=(size, 3)) 
coschi = np.random.random_sample((size))
theta = np.random.random_sample((size))*np.pi/2
pwr = np.cos(theta)
###################end of dummy data##########################

###################-----main loop------###############
@nb.jit(fastmath=True)
def func(theta, pwr, neighbour, coschi, temp):
    for k in range(np.argmax(pwr), 5000*(pwr.size)): 
        n = k%pwr.size
        if (np.abs(theta[n]-np.pi/2.)<np.abs(theta_c)):
                adj = neighbour[n,1]
        else:
                adj = neighbour[n,0]
        psi_diff = np.abs(np.arccos(coschi[adj])-np.arccos(coschi[n]))
        temp5 = temp[adj]**5;
        e_temp = 1.- np.exp(-temp5*psi_diff/np.abs(eps))
        temp[n] = temp[adj] + (e_temp)/temp5*(pwr[n] - temp[adj]**4)
    return temp

#check time
time1 = time.time()
temp = func(theta, pwr, neighbour, coschi, temp)
print("Took: ", time.time()-time1, " seconds.")

这需要3.49 seconds在我的计算机上。

出于某种模型拟合的目的，我需要运行此代码数千次，因此，即使是1秒的优化，也可以为我节省数十小时的时间。

如何进一步优化此代码？

Answer 1

让我从一些一般性评论开始：

如果您使用numba并真正关心性能，则应避免numba创建对象模式代码的任何可能性。这意味着您应该使用numba.njit(...)或numba.jit(nopython=True, ...)而不是numba.jit(...)。

这对您的情况没有影响，但可以使意图更清晰，并且在（快速）nopython模式下不支持某些操作时，就会引发异常。
您应该注意自己的时间和方式。首次调用numba-jitted函数（未提前编译）将包括编译成本。因此，您需要在计时之前运行一次以获取准确的时间。为了获得更准确的计时，您应该多次调用该函数。我喜欢Jupyter Notebooks / Lab中的IPython %timeit，以大致了解性能。

所以我将使用：
```
res1 = func(theta, pwr, neighbour, coschi, np.ones(size))
res2 = # other approach

np.testing.assert_allclose(res1, res2)

%timeit func(theta, pwr, neighbour, coschi, np.ones(size))
%timeit # other approach
```
这样，我将第一个调用（包括编译时间）与一个断言一起使用，以确保它确实产生（几乎）相同的输出，然后使用更可靠的计时方法（与{{1} }）。

起吊`time`

现在让我们从一些实际的性能优化开始：一个明显的例子是您可以提升一些“不变式”，例如np.arccos的计算比np.arccos(coschi[...])中实际元素的计算要频繁得多。。您遍历coschi中的每个元素大约5000次，并且每个循环要执行两个coschi！因此，让我们计算一次np.arccos中的arccos并将其存储在一个中间数组中，这样就可以在循环内访问它：

coschi

在我的计算机上已经快得多了：

@nb.njit(fastmath=True)
def func2(theta, pwr, neighbour, coschi, temp):
    arccos_coschi = np.arccos(coschi)
    for k in range(np.argmax(pwr), 5000 * pwr.size): 
        n = k % pwr.size
        if np.abs(theta[n] - np.pi / 2.) < np.abs(theta_c):
            adj = neighbour[n, 1]
        else:
            adj = neighbour[n, 0]
        psi_diff = np.abs(arccos_coschi[adj] - arccos_coschi[n])
        temp5 = temp[adj]**5;
        e_temp = 1. - np.exp(-temp5 * psi_diff / np.abs(eps))
        temp[n] = temp[adj] + e_temp / temp5 * (pwr[n] - temp[adj]**4)
    return temp

但是要付出代价：结果会有所不同！使用1.73 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # original 811 ms ± 49.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func2的原始版本和提升版本，我始终能获得明显不同的结果。但是，结果（几乎）等于fastmath=True。似乎fastmath=False对fastmath进行了一些严格的优化，而这些优化在悬挂np.arccos(coschi[adj]) - np.arccos(coschi[n])时是不可能的。我个人认为，np.arccos如果您关心准确的结果，或者您已经测试过结果的准确性不受Fastmath的显着影响，我会忽略它！

起吊`fastmath=True`

下一个要提升的是adj，它的计算也比必要的要多：

adj

这样的效果不是很大，但是可以衡量的：

@nb.njit(fastmath=True)
def func3(theta, pwr, neighbour, coschi, temp):
    arccos_coschi = np.arccos(coschi)
    associated_neighbour = np.empty(neighbour.shape[0], nb.int64)
    for idx in range(neighbour.shape[0]):
        if np.abs(theta[idx] - np.pi / 2.) < np.abs(theta_c):
            associated_neighbour[idx] = neighbour[idx, 1]
        else:
            associated_neighbour[idx] = neighbour[idx, 0]

    for k in range(np.argmax(pwr), 5000 * pwr.size): 
        n = k % pwr.size
        adj = associated_neighbour[n]
        psi_diff = np.abs(arccos_coschi[adj] - arccos_coschi[n])
        temp5 = temp[adj]**5;
        e_temp = 1. - np.exp(-temp5 * psi_diff / np.abs(eps))
        temp[n] = temp[adj] + e_temp / temp5 * (pwr[n] - temp[adj]**4)
    return temp

悬挂其他计算似乎对我的计算机的性能没有影响，因此在此不包括它们。因此，这似乎是无需更改算法即可达到的效果。

重构为较小的功能（+较小的附加更改）

但是，我建议将提升功能与其他功能分开，使所有变量都具有功能参数，而不要查找全局变量。可能不会提高速度，但可以使代码更具可读性：

1.75 s ± 110 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)  # original
761 ms ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func2
660 ms ± 8.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func3

在这里，我还做了一些其他更改：

使用@nb.njit def func4_inner(indices, pwr, associated_neighbour, arccos_coschi, temp, abs_eps): for n in indices: adj = associated_neighbour[n] psi_diff = np.abs(arccos_coschi[adj] - arccos_coschi[n]) temp5 = temp[adj]**5; e_temp = 1. - np.exp(-temp5 * psi_diff / abs_eps) temp[n] = temp[adj] + e_temp / temp5 * (pwr[n] - temp[adj]**4) return temp @nb.njit def get_relevant_neighbor(neighbour, abs_theta_minus_pi_half, abs_theta_c): associated_neighbour = np.empty(neighbour.shape[0], nb.int64) for idx in range(neighbour.shape[0]): if abs_theta_minus_pi_half[idx] < abs_theta_c: associated_neighbour[idx] = neighbour[idx, 1] else: associated_neighbour[idx] = neighbour[idx, 0] return associated_neighbour def func4(theta, pwr, neighbour, coschi, temp, theta_c, eps): arccos_coschi = np.arccos(coschi) abs_theta_minus_pi_half = np.abs(theta - (np.pi / 2.)) relevant_neighbors = get_relevant_neighbor(neighbour, abs_theta_minus_pi_half, abs(theta_c)) argmax_pwr = np.argmax(pwr) indices = np.tile(np.arange(pwr.size), 5000)[argmax_pwr:] return func4_inner(indices, pwr, relevant_neighbors, arccos_coschi, temp, abs(eps))预先计算索引，并与np.tile一起使用切片而不是range方法。
使用普通的NumPy（在numba之外）来计算%。

最后的时间安排和总结

np.arccos

因此，最后，最新方法比原始方法快大约3倍（没有1.79 s ± 49.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # original 844 ms ± 41.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func2 707 ms ± 31.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func3 550 ms ± 4.88 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func4）。如果您确定要使用fastmath，只需在fastmath上应用fastmath=True，它会更快：

func4_inner

但是，正如我已经说过的那样，499 ms ± 4.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) # func4 with fastmath on func4_inner如果您想要精确的（或至少不是太不精确的）结果可能不合适。

这里的几种优化很大程度上取决于可用的硬件和处理器缓存（尤其是对于代码中受内存带宽限制的部分）。您必须检查这些方法在计算机上相对于彼此的性能。

Answer 2

Numba真的很棒。但您绝望，请记住，您始终可以write in C（youtube）。就我自己的问题而言，仅通过逐行转换为C即可获得比numba更好的30％的性能。

如果您想花点时间，我建议使用eigen进行向量运算（在编译时知道向量的大小）和pybind11，因为它可以在numpy和本征之间进行自然转换。当然，将主循环保留在Python中。确保使用适当的编译器标志（例如-O3 -march=native，-mtune=native，-ffast-math）并尝试使用其他编译器（对我来说gcc的输出速度比clang，但同事报告的情况与此相反。

如果您不了解任何C ++，将自己限制为纯C且不包含任何库可能会更聪明（因为这会降低复杂性）。但是，您将直接处理Python和numpy C API（不那么困难，但是需要更多代码，并且您将学习有关Python内部的所有知识。）

Answer 3

对于初学者来说，您可以对某些操作使用标准的库和数学函数而不是numpy。仅这些更改就使我的计算机从2.2s升级到1.9s。

import numba as nb
import numpy as np
import math
import time 
#######create some synthetic data for illustration purpose##################
size=5000
eps = 0.2
theta_c = 0.4
temp = np.ones(size)
neighbour = np.random.randint(size, size=(size, 3)) 
coschi = np.random.random_sample((size))
theta = np.random.random_sample((size))*np.pi/2
pwr = np.cos(theta)
###################end of dummy data##########################

###################-----main loop------###############
@nb.jit(fastmath=True)
def func(theta, pwr, neighbour, coschi, temp):
    for k in range(np.argmax(pwr), 5000*(pwr.size)): 
        n = k%pwr.size
        #taking into account regions with different super wind direction
        if (abs(theta[n]-math.pi/2.)<abs(theta_c)):
                adj = neighbour[n,1]
        else:
                adj = neighbour[n,0]
        psi_diff = abs(math.acos(coschi[adj])-math.acos(coschi[n]))
        temp5 = temp[adj]**5;
        e_temp = 1.- math.exp(-temp5*psi_diff/abs(eps))
        temp[n] = temp[adj] + (e_temp)/temp5*(pwr[n] - temp[adj]**4)
    return temp

#check time
time1 = time.time()
temp = func(theta, pwr, neighbour, coschi, temp)
print("Took: ", time.time()-time1, " seconds.")

Answer 4

在您的示例中，您似乎正在处理很多重复项。

在此版本中，我不会为已经看到的'n'重新计算任何值。

我不知道您是否可以这样做，但是为我节省了大约0.4秒的时间。

$HOME/blender-deps/built/oiio/lib/libOpenImageIO.so.1.7.15

原始：哈希表

2.3726098537445070：1.8722639083862305

2.3447792530059814：1.9053585529327393

2.3363733291625977：1.9104151725769043

2.3447978496551514：1.9298338890075684

2.4740016460418700：1.9088914394378662

使用np.ones裸露的2500万个项目的循环：

#!/usr/bin/env python


import numba as nb
import numpy as np
import time
#######create some synthetic data for illustration purpose##################
size = 5000
eps = 0.2
theta_c = 0.4
temp = np.ones(size)
neighbour = np.random.randint(size, size=(size, 3))
coschi = np.random.random_sample((size))
theta = np.random.random_sample((size))*np.pi/2
pwr = np.cos(theta)
###################end of dummy data##########################

###################-----main loop------###############
@nb.jit(fastmath=True)
def func(theta, pwr, neighbour, coschi, temp):

    hashtable = {}

    for k in range(np.argmax(pwr), 5000*(pwr.size)):
        n = k % pwr.size

        if not hashtable.get(n, False):
            hashtable[n] = 1

            #taking into account regions with different super wind direction
            if (np.abs(theta[n]-np.pi/2.) < np.abs(theta_c)):
                    adj = neighbour[n, 1]
            else:
                    adj = neighbour[n, 0]
            psi_diff = np.abs(np.arccos(coschi[adj])-np.arccos(coschi[n]))
            temp5 = temp[adj]**5
            e_temp = 1. - np.exp(-temp5*psi_diff/np.abs(eps))
            retval = temp[adj] + (e_temp)/temp5*(pwr[n] - temp[adj]**4)

            temp[n] = retval


    return temp


#check time
time1 = time.time()
result = func(theta, pwr, neighbour, coschi, temp)
print("Took: ", time.time()-time1, "

输入：1.252222061157227秒（25000000个项目）

花费：1.294729232788086秒（2500万个物品）

拍摄：1.2670648097991943秒，显示25000000个项目

记录：1.2386720180511475秒（25000000个项目）

拍摄：1.2517566680908203秒（2500万个商品）

是否可以优化numba函数中的此循环以使其运行更快？

4 个答案:

起吊`time`

起吊`fastmath=True`

重构为较小的功能（+较小的附加更改）

最后的时间安排和总结

是否可以优化numba函数中的此循环以使其运行更快？

4 个答案:

起吊time

起吊fastmath=True

重构为较小的功能（+较小的附加更改）

最后的时间安排和总结

起吊`time`

起吊`fastmath=True`