Question

我正在julia写一个直接的例行程序，可移植性问题让我在C中重写它。写完之后，我对加速感到惊讶，我期待甚至一个数量级而不是两个！

我想知道在C中只改写是否正常如此加速，而Julia如此专注于速度和HPC。

这是代码，我简化了它们，使它们简洁，保留了C的加速比（所有质量都是1，力就是两个物体的距离）。

循环迭代每个索引（星形属性数组是固定大小，但我只使用前400个进行测试）并计算其余索引的贡献，然后使用Euler积分器计算新位置（新速度+ = F / m乘以dt，新位置+ =速度乘以dt）。

使用gcc编译的C代码，没有特殊标志，time ./a.out给出0.98s：

#include <stdio.h>
#include <stdlib.h>

// Array of stars is fixed size. It's initialized to a maximum size
// and only the needed portion it's used.
#define MAX_STAR_N (int)5e5

double *x,*y,*z,*vx,*vy,*vz;    

void evolve_bruteforce(double dt){
    // Compute forces and integrate the system with an Euler
    int i,j;
    for(i=0;i<400;i++){
        double cacheforce[3] = {0,0,0};
        double thisforce[3];
        for(j=0;j<400;j++){
            if(i!=j){
                thisforce[0] = (x[j] - x[i]);
                thisforce[1] = (y[j] - y[i]);
                thisforce[2] = (z[j] - z[i]);
                cacheforce[0] += thisforce[0];
                cacheforce[1] += thisforce[1];
                cacheforce[2] += thisforce[2];
            }
         }
        vx[i] += cacheforce[0]*dt;
        vy[i] += cacheforce[1]*dt;
        vz[i] += cacheforce[2]*dt;
    }
    for(i=0;i<400;i++){
       x[i] += vx[i]*dt;
       y[i] += vy[i]*dt;
       z[i] += vz[i]*dt;
    }
}



int main (int argc, char *argv[]){
    // Malloc all the arrays needed
    x   = malloc(sizeof(double)*MAX_STAR_N);
    y   = malloc(sizeof(double)*MAX_STAR_N);
    z   = malloc(sizeof(double)*MAX_STAR_N);
    vx  = malloc(sizeof(double)*MAX_STAR_N);
    vy  = malloc(sizeof(double)*MAX_STAR_N);
    vz  = malloc(sizeof(double)*MAX_STAR_N);

    int i;
    for(i=0;i<1000;i++)
    {
        evolve_bruteforce(0.001);
    }
}

用julia -O --check-bounds=no执行的Julia代码给出了102秒：

function evolve_bruteforce(dt,x,y,z,vx,vy,vz)
    for i in 1:400
        cacheforce = [0.0,0.0,0.0]
        thisforce = Vector{Float64}(3)
        for j in 1:400
            if i != j
                thisforce[1] = (x[j] - x[i])
                thisforce[2] = (y[j] - y[i])
                thisforce[3] = (z[j] - z[i])
                cacheforce[1] += thisforce[1]
                cacheforce[2] += thisforce[2]
                cacheforce[3] += thisforce[3]
                vx[i] += cacheforce[1]*dt
                vy[i] += cacheforce[2]*dt
                vz[i] += cacheforce[3]*dt
            end
            for i in 1:400
                x[i] += vx[i]*dt
                y[i] += vy[i]*dt
                z[i] += vz[i]*dt
            end
        end
    end
end



function main()
    x = zeros(500000)
    y = zeros(500000)
    z = zeros(500000)
    vx = zeros(500000)
    vy = zeros(500000)
    vz = zeros(500000)
    @time for i in 1:1000
        evolve_bruteforce(0.001,x,y,z,vx,vy,vz)
    end
end

main()

我不知道如何让这个更容易回答，如果我能以任何方式修改帖子，请告诉我。

Answer 1

正如评论中指出的那样，julia代码不等同于C代码。在julia代码中，第二个for i in 1:400在内部而不是在第一个for循环之后。 if语句中的代码也不一样。

以下版本的evolve_bruteforce更符合C代码：

function evolve_bruteforce(dt,x,y,z,vx,vy,vz)
    for i in 1:400
        cacheforce = [0.0,0.0,0.0]
        thisforce = Vector{Float64}(3)
        for j in 1:400
            if i != j
                thisforce[1] = (x[j] - x[i])
                thisforce[2] = (y[j] - y[i])
                thisforce[3] = (z[j] - z[i])
                cacheforce[1] += thisforce[1]
                cacheforce[2] += thisforce[2]
                cacheforce[3] += thisforce[3]
            end
        end
        # this bit was inside the if statement
        vx[i] += cacheforce[1]*dt
        vy[i] += cacheforce[2]*dt
        vz[i] += cacheforce[3]*dt
    end
    # this loop was nested inside the first one
    for i in 1:400
        x[i] += vx[i]*dt
        y[i] += vy[i]*dt
        z[i] += vz[i]*dt
    end
end

应该注意的是，这个答案中的基准非常天真且可能不公平，可以使用许多不同的语言和编译器特定优化来提升性能。

上面的Julia代码给出了大约2.2和1.7秒的执行时间：

# without any flags
2.188550 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)
2.199045 seconds (800.00 k allocations: 61.035 MB, 0.15% gc time)
2.194662 seconds (800.00 k allocations: 61.035 MB, 0.15% gc time)
# using the flags in the question: julia -O --check-bounds=on
1.688692 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)
1.705764 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)
1.688692 seconds (800.00 k allocations: 61.035 MB, 0.19% gc time)

在同一台笔记本电脑上，问题中发布的C代码的执行时间大约为1.6和0.6秒：

# gcc without any flags
1.568s
1.585s
1.592s
# using gcc -Ofast
0.620s
0.594s
0.568s

Answer 2

使用硬编码的3维代码时，使用Tuple类型代替Array更合适（在模拟过程中不会附加额外的物理尺寸 - 即使在执行超弦时理论）。

像这样重写@ jarmokivekas的evolve_bruteforce：

function evolve_bruteforce(dt,x,y,z,vx,vy,vz)
    for i in 1:400
        cacheforce = (0.0,0.0,0.0)
        thisforce = (0.0,0.0,0.0)
        for j in 1:400
            if i != j
                thisforce = ((x[j] - x[i]),(y[j] - y[i]),(z[j] - z[i]))
                cacheforce = (cacheforce[1]+thisforce[1],
                              cacheforce[2]+thisforce[2],
                              cacheforce[3]+thisforce[3])
            end
        end
        # this bit was inside the if statement
        (vx[i],vy[i],vz[i]) = (vx[i]+cacheforce[1]*dt,
                               vy[i]+cacheforce[2]*dt,
                               vz[i]+cacheforce[3]*dt)
    end
    # this loop was nested inside the first one
    for i in 1:400
         (x[i],y[i],z[i]) = (x[i]+vx[i]*dt,y[i]+vy[i]*dt,z[i]+vz[i]*dt)
    end
end

这使得另一个 2x加速（在此机器上从1.1秒到0.5秒）。

Answer 3

编辑：这个答案随着Julia 0.7-alpha版本的出现而来。此外，除非在发布之后，我没有观察到另一个答案非常接近我的。

处理短向量的“朱利安”方法是使用SVector或简单tuple s。对cacheforce和thisforce使用两个元组并使用-O3进行编译会产生比C更快的julia版本。

function evolve_bruteforce(dt,x,y,z,vx,vy,vz)

    @inbounds for i in 1:400
        cacheforce =  (0.0,0.0,0.0)
        thisforce  =  (0.0,0.0,0.0) 
        for j in 1:400
            if i != j
                thisforce  = (x[j] - x[i],
                              y[j] - y[i],
                              z[j] - z[i])
                cacheforce = (thisforce[1]+cacheforce[1],
                              thisforce[2]+cacheforce[2],
                              thisforce[3]+cacheforce[3])
            end
        end
        # this bit was inside the if statement
        vx[i] += cacheforce[1]*dt
        vy[i] += cacheforce[2]*dt
        vz[i] += cacheforce[3]*dt
    end
    # this loop was nested inside the first one
    @inbounds for i in 1:400
        x[i] += vx[i]*dt
        y[i] += vy[i]*dt
        z[i] += vz[i]*dt
    end
end

function main()
    x = Array{Float64}(undef,500000)
    y = Array{Float64}(undef,500000)
    z = Array{Float64}(undef,500000)
    vx = Array{Float64}(undef,500000)
    vy = Array{Float64}(undef,500000)
    vz = Array{Float64}(undef,500000)
    @time for i in 1:1000
        evolve_bruteforce(0.001,x,y,z,vx,vy,vz)
    end
end

main()

时机令人印象深刻：

# using julia -O3 main.jl
  0.223773 seconds
# using gcc -Ofast
  Process returned 0 (0x0)   execution time : 0.249 s

在使用Julia进行编码时，我的N-body程序运行速度比使用C编码时慢100倍，为什么？

3 个答案: