Question

我正在用Python编写一个科学应用程序，其核心是一个处理器密集型循环。我想尽可能优化这一点，至少给最终用户带来不便，最终用户可能会将其用作未编译的Python脚本集合，并将使用Windows，Mac和（主要是Ubuntu）Linux。

它目前用Python编写，带有一些NumPy，我已经包含了下面的代码。

是否有一个合理快速的解决方案，不需要编译？这似乎是保持平台独立性的最简单方法。
如果使用像Pyrex这样需要编译的东西，是否有一种简单的方法可以捆绑许多模块，并根据检测到的操作系统和Python版本让Python在它们之间进行选择？是否有一种简单的方法来构建模块集合，而无需使用每个版本的Python访问每个系统？
一种方法是否特别适合多处理器优化？

（如果你感兴趣的话，循环是通过将大量附近磁性离子的贡献加在一起来计算晶体内部给定点的磁场，将其视为微小条形磁铁。基本上，是一个巨大的总和these。）

# calculate_dipole
# -------------------------
# calculate_dipole works out the dipole field at a given point within the crystal unit cell
# ---
# INPUT
# mu = position at which to calculate the dipole field
# r_i = array of atomic positions
# mom_i = corresponding array of magnetic moments
# ---
# OUTPUT
# B = the B-field at this point

def calculate_dipole(mu, r_i, mom_i):
    relative = mu - r_i
    r_unit = unit_vectors(relative)
    #4pi / mu0 (at the front of the dipole eqn)
    A = 1e-7
    #initalise dipole field
    B = zeros(3,float)

    for i in range(len(relative)):
        #work out the dipole field and add it to the estimate so far
        B += A*(3*dot(mom_i[i],r_unit[i])*r_unit[i] - mom_i[i]) / sqrt(dot(relative[i],relative[i]))**3
    return B

Answer 1

如果你消除循环并使用Numpy的矢量化操作，你可以让它运行得更快更快。将您的数据放入形状为Numpy的数组（3，N）中并尝试以下操作：

import numpy as np

N = 20000
mu = np.random.random((3,1))
r_i = np.random.random((3,N))
mom_i = np.random.random((3,N))

def unit_vectors(r):
     return r / np.sqrt((r*r).sum(0))

def calculate_dipole(mu, r_i, mom_i):
    relative = mu - r_i
    r_unit = unit_vectors(relative)
    A = 1e-7

    num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i)
    den = np.sqrt(np.sum(relative*relative, 0))**3
    B = np.sum(num/den, 1)
    return B

这比使用for循环快了大约50倍。

Answer 2

Numpy确实使用了一些原生优化来进行数组处理。您可以将Numpy数组与Cython一起使用以获得一些加速。

Answer 3

你的python代码可能会通过用生成器表达式替换你的循环并删除所有的mom_i [i]，relative [i]和r_unit [i]的查找，通过并行迭代所有三个序列来加速itertools.izip。

即。取代

B = zeros(3,float)

for i in range(len(relative)):
    #work out the dipole field and add it to the estimate so far
    B += A*(3*dot(mom_i[i],r_unit[i])*r_unit[i] - mom_i[i]) / sqrt(dot(relative[i],relative[i]))**3
return B

使用：

from itertools import izip
...
return sum((A*(3*dot(mom,ru)*ru - mom) / sqrt(dot(rel,rel))**3 
            for mom, ru, rel in izip(mom_i, r_unit, relative)),
           zeros(3,float))

这也是更可读的恕我直言，因为核心方程不会随着[i]到处混乱..

我怀疑，与使用Cython这样的编译语言完成整个函数相比，这只能获得微不足道的收益。

Answer 4

一个简单但显着的加速是在你的总和之外乘以A。当你返回它时，你可以用它来计算B的次数：

for i in range(len(relative)):
    #work out the dipole field and add it to the estimate so far
    B += (3*dot(mom_i[i],r_unit[i])*r_unit[i] - mom_i[i]) / sqrt(dot(relative[i],relative[i]))**3

return A*B

使用20,000个随机偶极子，速度提高了8％。

除了那种简单的加速之外，我建议使用Cython（通常建议使用Pyrex）或Scipy编织。请查看Performance Python的一些示例，并比较各种加速Numpy / Scipy的方法。

如果您想尝试将其平行，我建议您查看Scipy的Parallel Programming以开始使用。

很高兴在SO上看到另一位物理学家。这里没有太多。

修改

我决定将此作为开发一些Cython技能的挑战，并且比Psyco优化版本提高了10倍的时间。如果您想查看我的代码，请告诉我。

<强> EDIT2：

好的，回过头来发现在我的Cython版本中放慢了什么。现在加速超过100倍。如果您想要或需要比Ray加速Numpy版本大2倍左右的因素，请告诉我，我会发布我的代码。

Cython源代码：

这是我鼓起的Cython代码：

import numpy as np cimport numpy as np cimport cython cdef extern from "math.h": double sqrt(double theta) ctypedef np.float64_t dtype_t @cython.boundscheck(False) @cython.wraparound(False) def calculate_dipole_cython(np.ndarray[dtype_t,ndim=2,mode="c"] mu, np.ndarray[dtype_t,ndim=2,mode="c"] r_i, np.ndarray[dtype_t,ndim=2,mode="c"] mom_i): cdef Py_ssize_t i cdef np.ndarray[dtype_t,ndim=1,mode="c"] tmp = np.empty(3,np.float64) cdef np.ndarray[dtype_t,ndim=1,mode="c"] relative = np.empty(3,np.float64) cdef double A = 1e-7 cdef double C, D, F cdef np.ndarray[dtype_t,ndim=1,mode="c"] B = np.zeros(3,np.float64) for i in xrange(r_i.shape[0]): relative[0] = mu[0,0] - r_i[i,0] relative[1] = mu[0,1] - r_i[i,1] relative[2] = mu[0,2] - r_i[i,2] C = relative[0]*relative[0] + relative[1]*relative[1] + relative[2]*relative[2] C = 1.0/sqrt(C) D = C**3 tmp[0] = relative[0]*C F = mom_i[i,0]*tmp[0] tmp[1] = relative[1]*C F += mom_i[i,1]*tmp[1] tmp[2] = relative[2]*C F += mom_i[i,2]*tmp[2] F *= 3 B[0] += (F*tmp[0] - mom_i[i,0])*D B[1] += (F*tmp[1] - mom_i[i,1])*D B[2] += (F*tmp[2] - mom_i[i,2])*D return A*B

我认为我已经对它进行了一些优化，但可能会有更多的东西可以摆脱它。您仍然可以使用Numpy C API的直接调用替换np.zeros和np.empty，但这不会产生太大的影响。就目前而言，此代码比您拥有的Numpy优化代码提高了2-3倍。但是，您需要正确传递数字。数组需要采用C格式（这是Numpy数组的默认值，但在Numpy中，C格式数组的转置是Fortran格式化数组）。

例如，要运行your other question中的代码，您需要将np.random.random((3,N))替换为np.random.random((N,3))。还有，`

r_test_fast = reshape_vector(r_test)

需要更改为

r_test_fast = np.array(np.matrix(r_test))

这最后一行可以更简单/更快，但在我看来这将是不成熟的优化。

如果您以前没有使用过Cython而且不知道如何编译，请告诉我，我很乐意提供帮助。

最后，我建议您查看this paper。我用它作为我优化的指南。下一步是尝试使用BLAS函数，这些函数使用SSE2指令集，尝试使用SSE API，或者尝试使用更多与SSE2接口连接的Numpy C API。此外，您可以考虑并行化。

Answer 5

Python不适用于高性能计算。在C中编写核心循环并从Python中调用它。

在Python中使用快速循环的最平台和Python版本最独立的方法是什么？

5 个答案: