numpy / scipy中牛顿力的最佳性能计算

时间:2015-02-17 00:23:15

标签: python arrays performance numpy scipy

对于大学的练习,我们必须在Python中使用精确的牛顿力来实现Leapfrog integrator。课程已经结束,我们的解决方案已经足够好了,但我想知道是否/如何能够更好地提高力计算的性能。

瓶颈是计算所有力(又称加速度):

a<sub>i</sub> = Σ<sub>j≠i</sub> Gm<sub>j</sub> / |r<sub>1</sub>-r<sub>2</sub>|<sup>3</sup> * (r<sub>1</sub>-r<sub>2</sub>)

对于大(1000和更大)数量的粒子N(i,j

这里r 1 和r 2 是存储在形状(N,3)的ndarray中的粒子位置的三维向量,Gm是粒子质量乘以我在形状(N)的ndarray中保存的引力常数。

我到目前为止找到的最快版本如下:

def a(self):
    sep = self.r[np.newaxis, :] - self.r[:, np.newaxis]
    # Calculate the distances between all particles with cdist
    # this is much faster than by hand
    dists = cdist(self.r, self.r)
    scale =dists*dists*dists
    # set diagonal elements of dist to something != 0, to avoid division by 0
    np.fill_diagonal(scale,1)
    Fsum = (sep/scale.reshape(self.particlenr,self.particlenr,1))*self.Gm[:,None]
    return np.add.reduce(Fsum,axis=1)

但这让我觉得这可能不是最快的版本。与cdist进行比较时,第一行似乎太慢,而cdist的计算基本相同。此外,该解决方案不使用问题中切换r 1 和r 2 的对称性,并计算所有元素两次。

您是否知道任何性能改进(不将力计算更改为某种近似值或更改编程语言)?

3 个答案:

答案 0 :(得分:1)

我试一试:我实施了一个例程,它确定了一个a_i

import numpy as np

GM = .01  #  article mass times the gravitation

def calc_a_i(rr, i):
    """ Calculate one a_i """
    drr = rr - rr[i, :] # r_j - r_i
    dr3 = np.linalg.norm(drr, axis=1)**3  # |r_j - r_i|**3
    dr3[i] = 1  # case i==j: drr = [0, 0, 0]
    # this would be more robust (elimnate small denominators):
    #dr3 = np.where(np.abs(dr3) > 1e-12, dr3, 1)
    return np.sum(drr.T/dr3, axis=1)

n = 4000 # number of particles
rr = np.random.randn(n, 3) # generate some particles

# Calculate each a_i separately:
aa = np.array([calc_a_i(rr, i) for i in range(n)]) * GM # all a_i

为了测试它,我跑了:

In [1]: %timeit aa = np.array([calc_a_i(rr, i) for i in range(n)])
1 loops, best of 3: 2.93 s per loop

加速此类代码的最简单方法是使用numexpr来更快地评估数组表达式:

import numexpr as ne
ne.set_num_threads(1)  # multithreading causes to much overhead

def ne_calc_a_i( i):
    """ Use numexpr - here rr is global for easier parallelization"""
    dr1, dr2, dr3 = (rr - rr[i, :]).T # r_j - r_i
    drrp3 = ne.evaluate("sqrt(dr1**2 + dr2**2 + dr3**2)**3")
    drrp3[i] = 1
    return np.sum(np.vstack([dr1, dr2, dr3])/drrp3, axis=1)

# Calculate each a_i separately:
aa_ne = np.array([ne_calc_a_i(i) for i in range(n)]) * GM  # all a_i    

这将速度提高了2倍:

    In [2]: %timeit aa_ne = np.array([ne_calc_a_i(i) for i in range(n)])
    1 loops, best of 3: 1.29 s per loop

要进一步加快代码速度,请在IPython Cluster上运行代码:

# Start local cluster with 4 clients in a shell with: 
# ipcluster start -n 4

rc = Client()  # clients of cluster
dview = rc[:]  # view of clusters

dview.execute("import numpy as np")  # import libraries on clients
dview.execute("import numexpr as ne")
dview.execute("ne.set_num_threads(1)")

def para_calc_a(dview, rr):
    """ Only in function for %timeit """
    # send rr and ne_calc_a_i() to clients:
    dview.push(dict(rr=rr, ne_calc_a_i=ne_calc_a_i), block=True)
    return np.array(dview.map_sync(ne_calc_a_i, range(n)))*GM

加速比例超过四倍:

In[3] %timeit aa_p = para_calc_a(dview, rr)
1 loops, best of 3: 612 ms per loop

正如@mathdan已经指出的那样,如何优化这样的问题并不明显:如果内存总线或浮点单元是限制因素,它取决于您的CPU架构,这需要不同的技术。

为了获得更多收益,您可能需要查看Theano:它可以从Python动态生成代码GPU代码。

答案 1 :(得分:1)

以下是更优化:

import numpy as np
from scipy.spatial.distance import pdist, squareform    

def a6(r, Gm):
    dists = pdist(r)
    dists *= dists*dists
    dists = squareform(dists)
    np.fill_diagonal(dists, 1.)
    sep = r[np.newaxis, :] - r[:, np.newaxis]
    return np.einsum('ijk,ij->ik', sep, Gm/dists)

速度增益主要归因于einsum线;像这样使用pdistsquareform只比cdist的原始方式快一点。

你可以更进一步,例如通过使用线程和Numba(需要0.17.0版本)。虽然下面的代码非常难看,而且肯定可以改进很多,但速度非常快。

import numpy as np
import math
from numba import jit
from threading import Thread
NUM_THREADS = 2  # choose wisely

def a_numba_par(r, Gm):
    a = np.zeros_like(r)
    N = r.shape[0]

    offset = range(0, N+1, N//NUM_THREADS)
    chunks = zip(offset, offset[1:])
    threads = [Thread(target=_numba_loop, args=(r,Gm,a)+c) for c in chunks]

    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()

    return a

@jit(nopython=True, nogil=True)
def _numba_loop(r, Gm, a, i1, i2):
    N = r.shape[0]
    for i in range(i1, i2):
        _helper(r, Gm, i, 0  , i, a[i,:])
        _helper(r, Gm, i, i+1, N, a[i,:])
    return a

@jit(nopython=True, nogil=True)
def _helper(r, Gm, i, j1, j2, a):
    for j in range(j1, j2):
        dx = r[j,0] - r[i,0]
        dy = r[j,1] - r[i,1]
        dz = r[j,2] - r[i,2]

        sqeuc = dx*dx + dy*dy + dz*dz
        scale = Gm[j] / (sqeuc * math.sqrt(sqeuc))

        a[0] += scale * dx
        a[1] += scale * dy
        a[2] += scale * dz

答案 2 :(得分:0)

我怀疑numpy实际上是双倍计算距离(因为它总是对称的)。它可能正在进行一次计算并在两个地方分配相同的值。

但我确实想到了一些想法:

  1. 您可以关注numpy源代码并编写自定义版本的cdist。可能是程序正在解析许多选项每次迭代。不多,但也许它可以给你一个小百分比的冲击。
  2. 预分配。每次运行()时,都可能会为所有中间矩阵值重新分配内存。你可以制作这些持久数量吗?
  3. 我还没有完成计算,但如果能以某种方式优雅地减少冗余对称计算,我不会感到惊讶。