Question

我正在运行一个程序，需要在2D中的一组点上重复使用成对距离计算，然后计算向量。这最终导致我的运行时间出现瓶颈，因此我尝试将我的代码从Matlab重新编写为Julia，以利用其更快的速度。然而，我遇到的问题是我在Julia中编写的函数实际运行速度比我的Matlab实现慢五倍。鉴于朱莉娅的声誉是一种快得多的语言，我假设我做错了。

我写了一个简单的例子来说明我所看到的。

朱莉娅代码：

using Distances
function foo()
  historyarray = zeros(5000,150,2)
  a = randn(150,2)
  for i in 1:5000
    pd = pairwise(Euclidean(),a.')
    xgv = broadcast(-,a[:,1].',a[:,1])
    ygv = broadcast(-,a[:,2].',a[:,2])
    th = atan2(ygv,xgv)
    fv = 1./(pd+1)
    xfv = fv*cos(th)
    yfv = fv*sin(th)
    a[:,1]+= sum(xfv,2)
    a[:,2]+= sum(yfv,2)

    historyarray[i,:,:] = copy(a)
  end
end

Matlab代码：

function foo
histarray = zeros(5000,150,2);
a = randn(150,2);
for i=1:5000

    pd = pdist2(a,a);
    xgv = -bsxfun(@minus,a(:,1),a(:,1)');
    ygv = -bsxfun(@minus,a(:,2),a(:,2)');
    th = atan2(ygv,xgv);
    fv = 1./(pd+1);
    xfv = fv.*cos(th);
    yfv = fv.*sin(th);
    a(:,1) = a(:,1)+sum(xfv,2);
    a(:,2) = a(:,2)+sum(yfv,2);

    histarray(i,:,:)=a;
end

端

当我测试Julia代码的速度（多次考虑编译时间）时，我得到：

@time foo()
16.110077 seconds (2.65 M allocations: 8.566 GB, 6.30% gc time)

另一方面，Matlab函数的性能是：

tic
foo
toc
Elapsed time is 3.339807 seconds.

当我在Julia代码上运行配置文件查看器时，花费最多时间的组件是第9,11和12行。三角函数可能发生了奇怪的事情吗？

Answer 1

您对sin，cos和atan2的调用是您的Julia代码中的瓶颈，这是正确的。然而，大量的分配意味着仍有可能进行优化。

在即将推出的Julia版本中，您可以轻松地重写代码，以避免使用改进的点广播语法a .= f.(b,c)进行不必要的分配。这相当于broadcast!(f, a, b, c)并更新了a。此外，rhs上的几个点广播呼叫会自动融合为一个。最后，@views宏将所有切片操作（如a[:,1]）转换为视图。新代码如下：

function foo2()
    a = rand(150,2)
    historyarray = zeros(5000,150,2)
    pd = zeros(size(a,1), size(a,1))
    xgv = similar(pd)
    ygv = similar(pd)
    th = similar(pd)
    fv = similar(pd)
    xfv = similar(pd)
    yfv = similar(pd)
    tmp = zeros(size(a,1))
    @views for i in 1:5000
        pairwise!(pd, Euclidean(),a.')
        xgv .= a[:,1].' .- a[:,1]
        ygv .= a[:,2].' .- a[:,2]
        th .= atan2.(ygv,xgv)
        fv .= 1./(pd.+1)
        xfv .= fv.*cos.(th)
        yfv .= fv.*sin.(th)
        a[:,1:1] .+= sum!(tmp, xfv)
        a[:,2:2] .+= sum!(tmp, yfv)
        historyarray[i,:,:] = a
    end
end

（我在你的Matlab代码中使用xfv .= fv.*cos.(th)中的元素乘法，而不是矩阵乘法。）

对新代码进行基准测试显示分配的内存大幅减少：

julia> @benchmark foo2()
BenchmarkTools.Trial: 
  memory estimate:  67.88 MiB
  allocs estimate:  809507
  --------------
  minimum time:     7.655 s (0.06% GC)
  median time:      7.655 s (0.06% GC)
  mean time:        7.655 s (0.06% GC)
  maximum time:     7.655 s (0.06% GC)
  --------------
  samples:          1
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

（大部分可以在0.5上实现，但需要更多输入）

然而，这仍然是您的Matlab版本的两倍。分析表明，大部分时间都花在三角函数上。

为了好玩，我试过了：

const atan2 = +
const cos = x->2x
const sin = x->2x

得到了：

julia> @benchmark foo2()
BenchmarkTools.Trial: 
  memory estimate:  67.88 MiB
  allocs estimate:  809507
  --------------
  minimum time:     1.020 s (0.69% GC)
  median time:      1.028 s (0.68% GC)
  mean time:        1.043 s (2.10% GC)
  maximum time:     1.100 s (7.75% GC)
  --------------
  samples:          5
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%

我想，三角函数缓慢的一个原因可能是我使用预构建的二进制文件而没有Julia使用的libm库的自编译版本。因此，libm代码未针对我的处理器进行优化。但我怀疑在这种情况下，这将使Julia比Matlab快得多。对于这种算法，Matlab代码似乎已经接近最优。

Answer 2

我可以使用julia 0.5上的多线程来加速对比matlab。

在我的机器上（i5有4个核心）我得到以下时间：
matlab R2012a - 8.5秒
julia 0.5单线程 - foo3（）（见下文） - 18.5秒
julia 0.5多线程 - foo4（）（见下文） - 4.5秒

即。我能够让julia单线程函数运行速度是matlab的两倍，但多线程函数的运行速度是matlab的两倍。

抱歉这是一个非常冗长的答案 - 认为更全面。我在每个使用的内部函数以及主函数下面发布 - foo3（）和foo（4）。

<强> 1。单线程：

下面的开发函数的目的是避免不必要的内存分配并利用数组的对称性。根据蒂姆的回答，看起来大部分可以用0.6中的单行处理点符号。

function pdist2!(pd, a)
    m = size(a, 1)
    for col in 1:m
        for row in (col + 1):m
            s = 0.0
            for i in 1:2
                @inbounds s += abs2(a[col, i] - a[row, i])
            end
            @inbounds pd[row, col] = pd[col, row] = sqrt(s)
        end
    end
end

function dotminustranspose!(xgv, ygv, a)
    m = size(a, 1)
    for col in 1:m
        @inbounds for row in (col + 1):m
            xgv[row, col] = a[col, 1] - a[row, 1]
            ygv[row, col] = a[col, 2] - a[row, 2]
            xgv[col, row] = - xgv[row, col]
            ygv[col, row] = - ygv[row, col]
        end
    end
end

function atan2!(th, ygv, xgv)
    for i in eachindex(ygv)
        @inbounds th[i] = atan2(ygv[i], xgv[i])
    end
end

function invpdp1!(fv, pd)
    for i in eachindex(pd)
        @inbounds fv[i] = 1 / (pd[i] + 1)
    end
end

function fv_times_cos_th!(xfv, fv, th)
    for i in eachindex(th)
        @inbounds xfv[i] = fv[i] * cos(th[i])
    end
end

function fv_times_sin_th!(yfv, fv, th)
    for i in eachindex(th)
        @inbounds yfv[i] = fv[i] * sin(th[i])
    end
end

function adsum2!(a, xfv, yfv)
    n = size(a, 1)
    for j in 1:n
        for i in 1:n
            @inbounds a[i, 1] += xfv[i, j]
            @inbounds a[i, 2] += yfv[i, j]
        end
    end
end

function foo3()
    a = reshape(sin(1:300), 150, 2)
    histarray = zeros(5000, 150, 2)
    pd = zeros(size(a, 1), size(a, 1))
    xgv = zeros(pd)
    ygv = zeros(pd)
    th = zeros(pd)
    fv = zeros(pd)
    xfv = zeros(pd)
    yfv = zeros(pd)
    for i in 1:5000
        pdist2!(pd, a)
        dotminustranspose!(xgv, ygv, a)
        atan2!(th, ygv, xgv)
        invpdp1!(fv, pd)
        fv_times_cos_th!(xfv, fv, th)
        fv_times_sin_th!(yfv, fv, th)
        adsum2!(a, xfv, yfv)

        histarray[i, :, :] = view(a, :)
    end
    return histarray
end

时间：

@time histarray = foo3()
17.966093 seconds (24.51 k allocations: 13.404 MB)

<强> 1。多线程：

使用@threads宏可以对元素三角函数进行多线程处理。这给了我大约4倍的加速。这仍然是实验性的，但我测试了输出，它们是相同的。

using Base.Threads

function atan2_mt!(th, ygv, xgv)
    @threads for i in eachindex(ygv)
        @inbounds th[i] = atan2(ygv[i], xgv[i])
    end
end

function fv_times_cos_th_mt!(xfv, fv, th)
    @threads for i in eachindex(th)
        @inbounds xfv[i] = fv[i] * cos(th[i])
    end
end

function fv_times_sin_th_mt!(yfv, fv, th)
    @threads for i in eachindex(th)
        @inbounds yfv[i] = fv[i] * sin(th[i])
    end
end

function foo4()
    a = reshape(sin(1:300), 150, 2)
    histarray = zeros(5000, 150, 2)
    pd = zeros(size(a, 1), size(a, 1))
    xgv = zeros(pd)
    ygv = zeros(pd)
    th = zeros(pd)
    fv = zeros(pd)
    xfv = zeros(pd)
    yfv = zeros(pd)
    for i in 1:5000
        pdist2!(pd, a)
        dotminustranspose!(xgv, ygv, a)
        atan2_mt!(th, ygv, xgv)
        invpdp1!(fv, pd)
        fv_times_cos_th_mt!(xfv, fv, th)
        fv_times_sin_th_mt!(yfv, fv, th)
        adsum2!(a, xfv, yfv)

        histarray[i, :, :] = view(a, :)
    end
    return histarray
end

时间：

@time histarray = foo4()
4.569416 seconds (54.51 k allocations: 14.320 MB, 0.20% gc time)

Answer 3

get-rid-of-trig重构背后的想法是sin(atan(x,y))==y/sqrt(x^2+y^2)。方便地，函数hypot计算平方根分母。 inv用于摆脱缓慢的分歧。代码：

# a constant input matrix to allow foo2/foo3 comparison
a = randn(150,2)

# calculation using trig functions
function foo2(b,n)
  a = copy(b)
  historyarray = zeros(n,size(a,1),2)
  pd = zeros(size(a,1), size(a,1))
  xgv = similar(pd)
  ygv = similar(pd)
  th = similar(pd)
  fv = similar(pd)
  xfv = similar(pd)
  yfv = similar(pd)
  tmp = zeros(size(a,1))
  @views for i in 1:n
      pairwise!(pd, Euclidean(),a.')
      xgv .= a[:,1].' .- a[:,1]
      ygv .= a[:,2].' .- a[:,2]
      th .= atan2.(ygv,xgv)
      fv .= 1./(pd.+1)
      xfv .= fv.*cos.(th)
      yfv .= fv.*sin.(th)
      a[:,1:1] .+= sum!(tmp, xfv)
      a[:,2:2] .+= sum!(tmp, yfv)
      historyarray[i,:,:] = a
  end
end

# helper function to handle annoying Infs from self interaction calc
nantoone(x) = ifelse(isnan(x),1.0,x)
nantozero(x) = ifelse(isnan(x),0.0,x)

# calculations using Pythagoras
function foo3(b,n)
  a = copy(b)
  historyarray = zeros(5000,size(a,1),2)
  pd = zeros(size(a,1), size(a,1))
  xgv = similar(pd)
  ygv = similar(pd)
  th = similar(pd)
  fv = similar(pd)
  xfv = similar(pd)
  yfv = similar(pd)
  tmp = zeros(size(a,1))
  @views for i in 1:n
      pairwise!(pd, Euclidean(),a.')
      xgv .= a[:,1].' .- a[:,1]
      ygv .= a[:,2].' .- a[:,2]
      th .= inv.(hypot.(ygv,xgv))
      fv .= inv.(pd.+1)
      xfv .= nantoone.(fv.*xgv.*th)
      yfv .= nantozero.(fv.*ygv.*th)
      a[:,1:1] .+= sum!(tmp, xfv)
      a[:,2:2] .+= sum!(tmp, yfv)
      historyarray[i,:,:] = a
  end
end

还有一个benchamrk：

julia> @time foo2(a,5000)
  9.698825 seconds (809.51 k allocations: 67.880 MiB, 0.33% gc time)

julia> @time foo3(a,5000)
  2.207108 seconds (809.51 k allocations: 67.880 MiB, 1.15% gc time)

＆gt; 4x 改进。

另一个值得注意的是NaN-to-something函数的便利性，可以添加到Base（类似于SQL世界中的coalesce和nvl）。

Answer 4

您可以使用点表示法来广播某些操作。查看foo2()函数。

using Distances
function foo1()
  historyarray = zeros(5000,150,2)
  a = randn(150,2)
  for i in 1:5000
    pd = pairwise(Euclidean(),a.')
    xgv = broadcast(-,a[:,1].',a[:,1])
    ygv = broadcast(-,a[:,2].',a[:,2])
    th = atan2(ygv,xgv)
    fv = 1./(pd+1)
    xfv = fv*cos(th)
    yfv = fv*sin(th)
    a[:,1]+= sum(xfv,2)
    a[:,2]+= sum(yfv,2)

    historyarray[i,:,:] = copy(a)
  end
end


function foo2()
  historyarray = zeros(5000,150,2)
  a = randn(150,2)
  for i in 1:5000
    pd = pairwise(Euclidean(),a.')
    xgv = broadcast(-,a[:,1].',a[:,1])
    ygv = broadcast(-,a[:,2].',a[:,2])
    th = atan2.(ygv,xgv)
    fv = 1./(pd+1)
    xfv = fv.*cos.(th)
    yfv = fv.*sin.(th)
    a[:,1]+= sum(xfv,2)
    a[:,2]+= sum(yfv,2)

    historyarray[i,:,:] = copy(a)
  end
end

@time foo1()
@time foo2()

控制台输出：

29.723805 seconds (2.65 M allocations: 8.566 GB, 1.15% gc time)
16.296859 seconds (2.81 M allocations: 8.571 GB, 2.54% gc time)

与Matlab2016b相比，如何在Julia 0.5中获得更快的成对距离和矢量计算？

4 个答案: