Question

考虑Julia中的以下4个函数：它们都选择/计算矩阵A的随机列，并将此列的常量乘以向量z。

slow1和fast1之间的区别在于z的更新方式，同样适用于slow2和fast2。

1函数和2函数之间的区别在于矩阵A是传递给函数还是动态计算。

奇怪的是，对于1函数，fast1更快（正如我在使用BLAS而不是+=时所期望的那样），但对于2函数slow1更快。在这台计算机上，我得到以下时间（对于每个函数的第二次运行）：

@time slow1(A, z, 10000);
0.172560 seconds (110.01 k allocations: 940.102 MB, 12.98% gc time)

@time fast1(A, z, 10000);
0.142748 seconds (50.07 k allocations: 313.577 MB, 4.56% gc time)

@time slow2(complex(float(x)), complex(float(y)), z, 10000);
2.265950 seconds (120.01 k allocations: 1.529 GB, 1.20% gc time)

@time fast2(complex(float(x)), complex(float(y)), z, 10000);
4.351953 seconds (60.01 k allocations: 939.410 MB, 0.43% gc time)

这种行为有解释吗？还有一种让BLAS比+=更快的方法吗？

M = 2^10                                                                                                             
x = [-M:M-1;]

N = 2^9 
y = [-N:N-1;]

A = cis( -2*pi*x*y' )
z = rand(2*M) + rand(2*M)*im

function slow1(A::Matrix{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
    S = [1:size(A,2);]

    for iter = 1:maxiter
        idx = rand(S)
        col = A[:,idx]
        a = rand()
        z += a*col
    end 
end

function fast1(A::Matrix{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
    S = [1:size(A,2);]

    for iter = 1:maxiter
        idx = rand(S)
        col = A[:,idx]
        a = rand()
        BLAS.axpy!(a, col, z)
    end 
end

function slow2(x::Vector{Complex{Float64}}, y::Vector{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
    S = [1:length(y);]

    for iter = 1:maxiter
        idx = rand(S)
        col = cis( -2*pi*x*y[idx] )
        a = rand()
        z += a*col
    end
end

function fast2(x::Vector{Complex{Float64}}, y::Vector{Complex{Float64}}, z::Vector{Complex{Float64}}, maxiter::Int)
    S = [1:length(y);]

    for iter = 1:maxiter
        idx = rand(S)
        col = cis( -2*pi*x*y[idx] )
        a = rand()
        BLAS.axpy!(a, col, z)
    end
end

更新：分析slow2：

2260 task.jl; anonymous; line: 92
 2260 REPL.jl; eval_user_input; line: 63
  2260 profile.jl; anonymous; line: 16
   2175 /tmp/axpy.jl; slow2; line: 37
    10   arraymath.jl; .*; line: 118
    33   arraymath.jl; .*; line: 120
    5    arraymath.jl; .*; line: 125
    46   arraymath.jl; .*; line: 127
    3    complex.jl; cis; line: 286
    3    complex.jl; cis; line: 287
    2066 operators.jl; cis; line: 374
     72   complex.jl; cis; line: 286
     1914 complex.jl; cis; line: 287
   1    /tmp/axpy.jl; slow2; line: 38
   84   /tmp/axpy.jl; slow2; line: 39
    5  arraymath.jl; +; line: 96
    39 arraymath.jl; +; line: 98
    6  arraymath.jl; .*; line: 118
    34 arraymath.jl; .*; line: 120

分析fast2：

4288 task.jl; anonymous; line: 92
 4288 REPL.jl; eval_user_input; line: 63
  4288 profile.jl; anonymous; line: 16
   1    /tmp/axpy.jl; fast2; line: 47
    1 random.jl; rand; line: 214
   3537 /tmp/axpy.jl; fast2; line: 48
    26   arraymath.jl; .*; line: 118
    44   arraymath.jl; .*; line: 120
    1    arraymath.jl; .*; line: 122
    4    arraymath.jl; .*; line: 125
    53   arraymath.jl; .*; line: 127
    7    complex.jl; cis; line: 286
    3399 operators.jl; cis; line: 374
     116  complex.jl; cis; line: 286
     3108 complex.jl; cis; line: 287
   2    /tmp/axpy.jl; fast2; line: 49
   748  /tmp/axpy.jl; fast2; line: 50
    748 linalg/blas.jl; axpy!; line: 231

奇怪的是，col的计算时间不同，即使功能在这一点上是相同的。但+=仍然比axpy!快。

Answer 1

现在有更多信息，julia 0.6已经出局了。要将矢量乘以标量，至少有四个选项。在Tim的建议之后，我使用了BenchmarkTool的{{1}}宏。事实证明，循环融合是最古老的写作方式，与调用BLAS相同。这是朱莉娅开发商可以引以为傲的东西！

@btime

结果为10 ^ 5个数字。

using BenchmarkTools
function bmark(N)
           a = zeros(N);
           @btime $a *= -1.;
           @btime $a .*= -1.;
           @btime LinAlg.BLAS.scal!($N, -1.0, $a, 1);
           @btime scale!($a, -1.);
       end

分析回溯显示julia> bmark(10^5); 78.195 μs (2 allocations: 781.33 KiB) 35.102 μs (0 allocations: 0 bytes) 34.659 μs (0 allocations: 0 bytes) 34.664 μs (0 allocations: 0 bytes)只是在后台调用scale!，因此它们应该给出相同的最佳时间。

BLAS.axpy！朱莉娅比+ =慢

1 个答案: