我正在julia写一个直接的例行程序,可移植性问题让我在C中重写它。写完之后,我对加速感到惊讶,我期待甚至一个数量级而不是两个!
我想知道在C中只改写是否正常如此加速,而Julia如此专注于速度和HPC。
这是代码,我简化了它们,使它们简洁,保留了C的加速比(所有质量都是1,力就是两个物体的距离)。
循环迭代每个索引(星形属性数组是固定大小,但我只使用前400个进行测试)并计算其余索引的贡献,然后使用Euler积分器计算新位置(新速度+ = F / m乘以dt,新位置+ =速度乘以dt)。
使用gcc
编译的C代码,没有特殊标志,time ./a.out
给出0.98s:
#include <stdio.h>
#include <stdlib.h>
// Array of stars is fixed size. It's initialized to a maximum size
// and only the needed portion it's used.
#define MAX_STAR_N (int)5e5
double *x,*y,*z,*vx,*vy,*vz;
void evolve_bruteforce(double dt){
// Compute forces and integrate the system with an Euler
int i,j;
for(i=0;i<400;i++){
double cacheforce[3] = {0,0,0};
double thisforce[3];
for(j=0;j<400;j++){
if(i!=j){
thisforce[0] = (x[j] - x[i]);
thisforce[1] = (y[j] - y[i]);
thisforce[2] = (z[j] - z[i]);
cacheforce[0] += thisforce[0];
cacheforce[1] += thisforce[1];
cacheforce[2] += thisforce[2];
}
}
vx[i] += cacheforce[0]*dt;
vy[i] += cacheforce[1]*dt;
vz[i] += cacheforce[2]*dt;
}
for(i=0;i<400;i++){
x[i] += vx[i]*dt;
y[i] += vy[i]*dt;
z[i] += vz[i]*dt;
}
}
int main (int argc, char *argv[]){
// Malloc all the arrays needed
x = malloc(sizeof(double)*MAX_STAR_N);
y = malloc(sizeof(double)*MAX_STAR_N);
z = malloc(sizeof(double)*MAX_STAR_N);
vx = malloc(sizeof(double)*MAX_STAR_N);
vy = malloc(sizeof(double)*MAX_STAR_N);
vz = malloc(sizeof(double)*MAX_STAR_N);
int i;
for(i=0;i<1000;i++)
{
evolve_bruteforce(0.001);
}
}
用julia -O --check-bounds=no
执行的Julia代码给出了102秒:
function evolve_bruteforce(dt,x,y,z,vx,vy,vz)
for i in 1:400
cacheforce = [0.0,0.0,0.0]
thisforce = Vector{Float64}(3)
for j in 1:400
if i != j
thisforce[1] = (x[j] - x[i])
thisforce[2] = (y[j] - y[i])
thisforce[3] = (z[j] - z[i])
cacheforce[1] += thisforce[1]
cacheforce[2] += thisforce[2]
cacheforce[3] += thisforce[3]
vx[i] += cacheforce[1]*dt
vy[i] += cacheforce[2]*dt
vz[i] += cacheforce[3]*dt
end
for i in 1:400
x[i] += vx[i]*dt
y[i] += vy[i]*dt
z[i] += vz[i]*dt
end
end
end
end
function main()
x = zeros(500000)
y = zeros(500000)
z = zeros(500000)
vx = zeros(500000)
vy = zeros(500000)
vz = zeros(500000)
@time for i in 1:1000
evolve_bruteforce(0.001,x,y,z,vx,vy,vz)
end
end
main()
我不知道如何让这个更容易回答,如果我能以任何方式修改帖子,请告诉我。
答案 0 :(得分:9)
正如评论中指出的那样,julia代码不等同于C代码。在julia代码中,第二个for i in 1:400
在内部而不是在第一个for循环之后。 if语句中的代码也不一样。
以下版本的evolve_bruteforce
更符合C代码:
function evolve_bruteforce(dt,x,y,z,vx,vy,vz)
for i in 1:400
cacheforce = [0.0,0.0,0.0]
thisforce = Vector{Float64}(3)
for j in 1:400
if i != j
thisforce[1] = (x[j] - x[i])
thisforce[2] = (y[j] - y[i])
thisforce[3] = (z[j] - z[i])
cacheforce[1] += thisforce[1]
cacheforce[2] += thisforce[2]
cacheforce[3] += thisforce[3]
end
end
# this bit was inside the if statement
vx[i] += cacheforce[1]*dt
vy[i] += cacheforce[2]*dt
vz[i] += cacheforce[3]*dt
end
# this loop was nested inside the first one
for i in 1:400
x[i] += vx[i]*dt
y[i] += vy[i]*dt
z[i] += vz[i]*dt
end
end
应该注意的是,这个答案中的基准非常天真且可能不公平,可以使用许多不同的语言和编译器特定优化来提升性能。
上面的Julia代码给出了大约2.2和1.7秒的执行时间:
# without any flags
2.188550 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)
2.199045 seconds (800.00 k allocations: 61.035 MB, 0.15% gc time)
2.194662 seconds (800.00 k allocations: 61.035 MB, 0.15% gc time)
# using the flags in the question: julia -O --check-bounds=on
1.688692 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)
1.705764 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)
1.688692 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)
在同一台笔记本电脑上,问题中发布的C代码的执行时间大约为1.6和0.6秒:
# gcc without any flags
1.568s
1.585s
1.592s
# using gcc -Ofast
0.620s
0.594s
0.568s
答案 1 :(得分:9)
使用硬编码的3维代码时,使用Tuple
类型代替Array
更合适(在模拟过程中不会附加额外的物理尺寸 - 即使在执行超弦时理论)。
像这样重写@ jarmokivekas的evolve_bruteforce
:
function evolve_bruteforce(dt,x,y,z,vx,vy,vz)
for i in 1:400
cacheforce = (0.0,0.0,0.0)
thisforce = (0.0,0.0,0.0)
for j in 1:400
if i != j
thisforce = ((x[j] - x[i]),(y[j] - y[i]),(z[j] - z[i]))
cacheforce = (cacheforce[1]+thisforce[1],
cacheforce[2]+thisforce[2],
cacheforce[3]+thisforce[3])
end
end
# this bit was inside the if statement
(vx[i],vy[i],vz[i]) = (vx[i]+cacheforce[1]*dt,
vy[i]+cacheforce[2]*dt,
vz[i]+cacheforce[3]*dt)
end
# this loop was nested inside the first one
for i in 1:400
(x[i],y[i],z[i]) = (x[i]+vx[i]*dt,y[i]+vy[i]*dt,z[i]+vz[i]*dt)
end
end
这使得另一个 2x加速(在此机器上从1.1秒到0.5秒)。
答案 2 :(得分:0)
编辑:这个答案随着Julia 0.7-alpha版本的出现而来。此外,除非在发布之后,我没有观察到另一个答案非常接近我的。
处理短向量的“朱利安”方法是使用SVector
或简单tuple
s。对cacheforce
和thisforce
使用两个元组并使用-O3
进行编译会产生比C更快的julia版本。
function evolve_bruteforce(dt,x,y,z,vx,vy,vz)
@inbounds for i in 1:400
cacheforce = (0.0,0.0,0.0)
thisforce = (0.0,0.0,0.0)
for j in 1:400
if i != j
thisforce = (x[j] - x[i],
y[j] - y[i],
z[j] - z[i])
cacheforce = (thisforce[1]+cacheforce[1],
thisforce[2]+cacheforce[2],
thisforce[3]+cacheforce[3])
end
end
# this bit was inside the if statement
vx[i] += cacheforce[1]*dt
vy[i] += cacheforce[2]*dt
vz[i] += cacheforce[3]*dt
end
# this loop was nested inside the first one
@inbounds for i in 1:400
x[i] += vx[i]*dt
y[i] += vy[i]*dt
z[i] += vz[i]*dt
end
end
function main()
x = Array{Float64}(undef,500000)
y = Array{Float64}(undef,500000)
z = Array{Float64}(undef,500000)
vx = Array{Float64}(undef,500000)
vy = Array{Float64}(undef,500000)
vz = Array{Float64}(undef,500000)
@time for i in 1:1000
evolve_bruteforce(0.001,x,y,z,vx,vy,vz)
end
end
main()
时机令人印象深刻:
# using julia -O3 main.jl
0.223773 seconds
# using gcc -Ofast
Process returned 0 (0x0) execution time : 0.249 s