我正在运行一个程序,需要在2D中的一组点上重复使用成对距离计算,然后计算向量。这最终导致我的运行时间出现瓶颈,因此我尝试将我的代码从Matlab重新编写为Julia,以利用其更快的速度。然而,我遇到的问题是我在Julia中编写的函数实际运行速度比我的Matlab实现慢五倍。鉴于朱莉娅的声誉是一种快得多的语言,我假设我做错了。
我写了一个简单的例子来说明我所看到的。
朱莉娅代码:
using Distances
function foo()
historyarray = zeros(5000,150,2)
a = randn(150,2)
for i in 1:5000
pd = pairwise(Euclidean(),a.')
xgv = broadcast(-,a[:,1].',a[:,1])
ygv = broadcast(-,a[:,2].',a[:,2])
th = atan2(ygv,xgv)
fv = 1./(pd+1)
xfv = fv*cos(th)
yfv = fv*sin(th)
a[:,1]+= sum(xfv,2)
a[:,2]+= sum(yfv,2)
historyarray[i,:,:] = copy(a)
end
end
Matlab代码:
function foo
histarray = zeros(5000,150,2);
a = randn(150,2);
for i=1:5000
pd = pdist2(a,a);
xgv = -bsxfun(@minus,a(:,1),a(:,1)');
ygv = -bsxfun(@minus,a(:,2),a(:,2)');
th = atan2(ygv,xgv);
fv = 1./(pd+1);
xfv = fv.*cos(th);
yfv = fv.*sin(th);
a(:,1) = a(:,1)+sum(xfv,2);
a(:,2) = a(:,2)+sum(yfv,2);
histarray(i,:,:)=a;
end
端
当我测试Julia代码的速度(多次考虑编译时间)时,我得到:
@time foo()
16.110077 seconds (2.65 M allocations: 8.566 GB, 6.30% gc time)
另一方面,Matlab函数的性能是:
tic
foo
toc
Elapsed time is 3.339807 seconds.
当我在Julia代码上运行配置文件查看器时,花费最多时间的组件是第9,11和12行。三角函数可能发生了奇怪的事情吗?
答案 0 :(得分:6)
您对sin
,cos
和atan2
的调用是您的Julia代码中的瓶颈,这是正确的。然而,大量的分配意味着仍有可能进行优化。
在即将推出的Julia版本中,您可以轻松地重写代码,以避免使用改进的点广播语法a .= f.(b,c)
进行不必要的分配。这相当于broadcast!(f, a, b, c)
并更新了a
。此外,rhs上的几个点广播呼叫会自动融合为一个。最后,@views
宏将所有切片操作(如a[:,1]
)转换为视图。新代码如下:
function foo2()
a = rand(150,2)
historyarray = zeros(5000,150,2)
pd = zeros(size(a,1), size(a,1))
xgv = similar(pd)
ygv = similar(pd)
th = similar(pd)
fv = similar(pd)
xfv = similar(pd)
yfv = similar(pd)
tmp = zeros(size(a,1))
@views for i in 1:5000
pairwise!(pd, Euclidean(),a.')
xgv .= a[:,1].' .- a[:,1]
ygv .= a[:,2].' .- a[:,2]
th .= atan2.(ygv,xgv)
fv .= 1./(pd.+1)
xfv .= fv.*cos.(th)
yfv .= fv.*sin.(th)
a[:,1:1] .+= sum!(tmp, xfv)
a[:,2:2] .+= sum!(tmp, yfv)
historyarray[i,:,:] = a
end
end
(我在你的Matlab代码中使用xfv .= fv.*cos.(th)
中的元素乘法,而不是矩阵乘法。)
对新代码进行基准测试显示分配的内存大幅减少:
julia> @benchmark foo2()
BenchmarkTools.Trial:
memory estimate: 67.88 MiB
allocs estimate: 809507
--------------
minimum time: 7.655 s (0.06% GC)
median time: 7.655 s (0.06% GC)
mean time: 7.655 s (0.06% GC)
maximum time: 7.655 s (0.06% GC)
--------------
samples: 1
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
(大部分可以在0.5上实现,但需要更多输入)
然而,这仍然是您的Matlab版本的两倍。分析表明,大部分时间都花在三角函数上。
为了好玩,我试过了:
const atan2 = +
const cos = x->2x
const sin = x->2x
得到了:
julia> @benchmark foo2()
BenchmarkTools.Trial:
memory estimate: 67.88 MiB
allocs estimate: 809507
--------------
minimum time: 1.020 s (0.69% GC)
median time: 1.028 s (0.68% GC)
mean time: 1.043 s (2.10% GC)
maximum time: 1.100 s (7.75% GC)
--------------
samples: 5
evals/sample: 1
time tolerance: 5.00%
memory tolerance: 1.00%
我想,三角函数缓慢的一个原因可能是我使用预构建的二进制文件而没有Julia使用的libm
库的自编译版本。因此,libm
代码未针对我的处理器进行优化。但我怀疑在这种情况下,这将使Julia比Matlab快得多。对于这种算法,Matlab代码似乎已经接近最优。
答案 1 :(得分:4)
我可以使用julia 0.5上的多线程来加速对比matlab。
在我的机器上(i5有4个核心)我得到以下时间:
matlab R2012a - 8.5秒
julia 0.5单线程 - foo3()(见下文) - 18.5秒
julia 0.5多线程 - foo4()(见下文) - 4.5秒
即。我能够让julia单线程函数运行速度是matlab的两倍,但多线程函数的运行速度是matlab的两倍。
抱歉这是一个非常冗长的答案 - 认为更全面。我在每个使用的内部函数以及主函数下面发布 - foo3()和foo(4)。
<强> 1。单线程:
下面的开发函数的目的是避免不必要的内存分配并利用数组的对称性。根据蒂姆的回答,看起来大部分可以用0.6中的单行处理点符号。
function pdist2!(pd, a)
m = size(a, 1)
for col in 1:m
for row in (col + 1):m
s = 0.0
for i in 1:2
@inbounds s += abs2(a[col, i] - a[row, i])
end
@inbounds pd[row, col] = pd[col, row] = sqrt(s)
end
end
end
function dotminustranspose!(xgv, ygv, a)
m = size(a, 1)
for col in 1:m
@inbounds for row in (col + 1):m
xgv[row, col] = a[col, 1] - a[row, 1]
ygv[row, col] = a[col, 2] - a[row, 2]
xgv[col, row] = - xgv[row, col]
ygv[col, row] = - ygv[row, col]
end
end
end
function atan2!(th, ygv, xgv)
for i in eachindex(ygv)
@inbounds th[i] = atan2(ygv[i], xgv[i])
end
end
function invpdp1!(fv, pd)
for i in eachindex(pd)
@inbounds fv[i] = 1 / (pd[i] + 1)
end
end
function fv_times_cos_th!(xfv, fv, th)
for i in eachindex(th)
@inbounds xfv[i] = fv[i] * cos(th[i])
end
end
function fv_times_sin_th!(yfv, fv, th)
for i in eachindex(th)
@inbounds yfv[i] = fv[i] * sin(th[i])
end
end
function adsum2!(a, xfv, yfv)
n = size(a, 1)
for j in 1:n
for i in 1:n
@inbounds a[i, 1] += xfv[i, j]
@inbounds a[i, 2] += yfv[i, j]
end
end
end
function foo3()
a = reshape(sin(1:300), 150, 2)
histarray = zeros(5000, 150, 2)
pd = zeros(size(a, 1), size(a, 1))
xgv = zeros(pd)
ygv = zeros(pd)
th = zeros(pd)
fv = zeros(pd)
xfv = zeros(pd)
yfv = zeros(pd)
for i in 1:5000
pdist2!(pd, a)
dotminustranspose!(xgv, ygv, a)
atan2!(th, ygv, xgv)
invpdp1!(fv, pd)
fv_times_cos_th!(xfv, fv, th)
fv_times_sin_th!(yfv, fv, th)
adsum2!(a, xfv, yfv)
histarray[i, :, :] = view(a, :)
end
return histarray
end
时间:
@time histarray = foo3()
17.966093 seconds (24.51 k allocations: 13.404 MB)
<强> 1。多线程:
使用@threads
宏可以对元素三角函数进行多线程处理。这给了我大约4倍的加速。这仍然是实验性的,但我测试了输出,它们是相同的。
using Base.Threads
function atan2_mt!(th, ygv, xgv)
@threads for i in eachindex(ygv)
@inbounds th[i] = atan2(ygv[i], xgv[i])
end
end
function fv_times_cos_th_mt!(xfv, fv, th)
@threads for i in eachindex(th)
@inbounds xfv[i] = fv[i] * cos(th[i])
end
end
function fv_times_sin_th_mt!(yfv, fv, th)
@threads for i in eachindex(th)
@inbounds yfv[i] = fv[i] * sin(th[i])
end
end
function foo4()
a = reshape(sin(1:300), 150, 2)
histarray = zeros(5000, 150, 2)
pd = zeros(size(a, 1), size(a, 1))
xgv = zeros(pd)
ygv = zeros(pd)
th = zeros(pd)
fv = zeros(pd)
xfv = zeros(pd)
yfv = zeros(pd)
for i in 1:5000
pdist2!(pd, a)
dotminustranspose!(xgv, ygv, a)
atan2_mt!(th, ygv, xgv)
invpdp1!(fv, pd)
fv_times_cos_th_mt!(xfv, fv, th)
fv_times_sin_th_mt!(yfv, fv, th)
adsum2!(a, xfv, yfv)
histarray[i, :, :] = view(a, :)
end
return histarray
end
时间:
@time histarray = foo4()
4.569416 seconds (54.51 k allocations: 14.320 MB, 0.20% gc time)
答案 2 :(得分:4)
get-rid-of-trig重构背后的想法是sin(atan(x,y))==y/sqrt(x^2+y^2)
。方便地,函数hypot
计算平方根分母。 inv
用于摆脱缓慢的分歧。代码:
# a constant input matrix to allow foo2/foo3 comparison
a = randn(150,2)
# calculation using trig functions
function foo2(b,n)
a = copy(b)
historyarray = zeros(n,size(a,1),2)
pd = zeros(size(a,1), size(a,1))
xgv = similar(pd)
ygv = similar(pd)
th = similar(pd)
fv = similar(pd)
xfv = similar(pd)
yfv = similar(pd)
tmp = zeros(size(a,1))
@views for i in 1:n
pairwise!(pd, Euclidean(),a.')
xgv .= a[:,1].' .- a[:,1]
ygv .= a[:,2].' .- a[:,2]
th .= atan2.(ygv,xgv)
fv .= 1./(pd.+1)
xfv .= fv.*cos.(th)
yfv .= fv.*sin.(th)
a[:,1:1] .+= sum!(tmp, xfv)
a[:,2:2] .+= sum!(tmp, yfv)
historyarray[i,:,:] = a
end
end
# helper function to handle annoying Infs from self interaction calc
nantoone(x) = ifelse(isnan(x),1.0,x)
nantozero(x) = ifelse(isnan(x),0.0,x)
# calculations using Pythagoras
function foo3(b,n)
a = copy(b)
historyarray = zeros(5000,size(a,1),2)
pd = zeros(size(a,1), size(a,1))
xgv = similar(pd)
ygv = similar(pd)
th = similar(pd)
fv = similar(pd)
xfv = similar(pd)
yfv = similar(pd)
tmp = zeros(size(a,1))
@views for i in 1:n
pairwise!(pd, Euclidean(),a.')
xgv .= a[:,1].' .- a[:,1]
ygv .= a[:,2].' .- a[:,2]
th .= inv.(hypot.(ygv,xgv))
fv .= inv.(pd.+1)
xfv .= nantoone.(fv.*xgv.*th)
yfv .= nantozero.(fv.*ygv.*th)
a[:,1:1] .+= sum!(tmp, xfv)
a[:,2:2] .+= sum!(tmp, yfv)
historyarray[i,:,:] = a
end
end
还有一个benchamrk:
julia> @time foo2(a,5000)
9.698825 seconds (809.51 k allocations: 67.880 MiB, 0.33% gc time)
julia> @time foo3(a,5000)
2.207108 seconds (809.51 k allocations: 67.880 MiB, 1.15% gc time)
&gt; 4x 改进。
另一个值得注意的是NaN-to-something函数的便利性,可以添加到Base(类似于SQL世界中的coalesce
和nvl
)。
答案 3 :(得分:1)
您可以使用点表示法来广播某些操作。查看foo2()
函数。
using Distances
function foo1()
historyarray = zeros(5000,150,2)
a = randn(150,2)
for i in 1:5000
pd = pairwise(Euclidean(),a.')
xgv = broadcast(-,a[:,1].',a[:,1])
ygv = broadcast(-,a[:,2].',a[:,2])
th = atan2(ygv,xgv)
fv = 1./(pd+1)
xfv = fv*cos(th)
yfv = fv*sin(th)
a[:,1]+= sum(xfv,2)
a[:,2]+= sum(yfv,2)
historyarray[i,:,:] = copy(a)
end
end
function foo2()
historyarray = zeros(5000,150,2)
a = randn(150,2)
for i in 1:5000
pd = pairwise(Euclidean(),a.')
xgv = broadcast(-,a[:,1].',a[:,1])
ygv = broadcast(-,a[:,2].',a[:,2])
th = atan2.(ygv,xgv)
fv = 1./(pd+1)
xfv = fv.*cos.(th)
yfv = fv.*sin.(th)
a[:,1]+= sum(xfv,2)
a[:,2]+= sum(yfv,2)
historyarray[i,:,:] = copy(a)
end
end
@time foo1()
@time foo2()
控制台输出:
29.723805 seconds (2.65 M allocations: 8.566 GB, 1.15% gc time)
16.296859 seconds (2.81 M allocations: 8.571 GB, 2.54% gc time)