Question

据说Julia for循环和矢量化操作一样快，甚至更快（如果使用得当）。我有两段代码。我们的想法是找到给定0-1序列的样本统计量，即x（在这两个例子中，我试图找到一个总和，但是有更复杂的例子，我只是想尝试理解我的代码中性能陷阱的一般含义）。第一个看起来像：

S = 2 * sum(x) - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)

第二个是＆＃34;天真＆＃34;和经典：

S = 0
for i in eachindex(x)
    S += x[i]
end
S = 2 * S - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)

使用@benchmark我发现第一个例子的运行时间大约是11.8毫秒，而第二个例子的运行时间是38毫秒。

这个例子对我来说非常重要，因为有很多其他的地方，矢量化是不可能的，所以我想用发展中的方式进行计算＆＃34;方式＆＃34;和矢量化一样快。

有没有想法为什么devectorized代码可能比矢量化慢4倍？类型稳定性还可以，没有不必要的大内存分配等。

第一个功能的代码是：

function frequency_monobit_test1( x :: Array{Int8, 1}, n = 0)
# count 1 and 0 in sequence, we want the number
# of 1's and 0's to be approximately the same
# reccomendation n >= 100
# decision Rule(at 1% level): if pval < 0.01 -> non-random
if (n == 0)
    n = length(x)
end
S = 2 * sum(x) - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)
return pval

第二个是：

function frequency_monobit_test2( x :: Array{Int8, 1}, n = 0)
# count 1 and 0 in sequence, we want the number
# of 1's and 0's to be approximately the same
# reccomendation n >= 100
# decision Rule(at 1% level): if pval < 0.01 -> non-random
if (n == 0)
    n = length(x)
end
S = 0
@inbounds for i in eachindex(x)
    S += x[i]
end
S = 2 * S - n
s_obs_new = abs(S) / sqrt(2 * n)
pval = erfc(s_obs_new)
return pval

Answer 1

这是一个奇怪的案例。在Int8变量中累积Int64时似乎存在性能问题。

让我们试试这些功能：

using SpecialFunctions, BenchmarkTools

function frequency_monobit_test1(x, n=length(x))
    S = sum(x)
    return erfc(abs(2S - n) / sqrt(2n))
end

function frequency_monobit_test3(typ::Type{<:Integer}, x, n=length(x))
    S = zero(typ)
    for i in eachindex(x)
        @inbounds S += x[i]
    end
    return erfc(abs(2S - n) / sqrt(2n))
end

初始化一些载体

N = 2^25;
x64 = rand(0:1, N);
x8 = rand(Int8[0, 1], N);
xB = rand(Bool, N);
xb = bitrand(N);

基准：

对于Int64输入：

julia> @btime frequency_monobit_test1($x64)
  17.540 ms (0 allocations: 0 bytes)
0.10302739916042186

julia> @btime frequency_monobit_test3(Int64, $x64)
  17.796 ms (0 allocations: 0 bytes)
0.10302739916042186

julia> @btime frequency_monobit_test3(Int32, $x64)
  892.715 ms (67106751 allocations: 1023.97 MiB)
0.10302739916042186

我们看到sum和显式循环同样快，并且用Int32初始化是一个坏主意。

对于Int32输入：

julia> @btime frequency_monobit_test1($x32)
  9.137 ms (0 allocations: 0 bytes)
0.2386386867682374

julia> @btime frequency_monobit_test3(Int64, $x32)
  8.839 ms (0 allocations: 0 bytes)
0.2386386867682374

julia> @btime frequency_monobit_test3(Int32, $x32)
  7.274 ms (0 allocations: 0 bytes)
0.2386386867682374

sum和循环的速度相似。累积到Int32会节省一些时间。

Int8输入：

julia> @btime frequency_monobit_test1($x8)
  5.681 ms (0 allocations: 0 bytes)
0.16482999123032094

julia> @btime frequency_monobit_test3(Int64, $x8)
  19.517 ms (0 allocations: 0 bytes)
0.16482999123032094

julia> @btime frequency_monobit_test3(Int32, $x8)
  4.815 ms (0 allocations: 0 bytes)
0.16482999123032094

显式循环，如果累积到Int32时稍微快一点，但是圣牛！ Int64发生了什么事？那太慢了！

Bool怎么样？

julia> @btime frequency_monobit_test1($xB)
  9.627 ms (0 allocations: 0 bytes)
0.7728544347518309

julia> @btime frequency_monobit_test3(Int64, $xB)
  9.629 ms (0 allocations: 0 bytes)
0.7728544347518309

julia> @btime frequency_monobit_test3(Int32, $xB)
  4.815 ms (0 allocations: 0 bytes)
0.7728544347518309

循环和sum具有相同的速度，但累积到Int32会节省一半的时间。

现在我们将尝试BitArray：

julia> @btime frequency_monobit_test1($xb)
  259.044 μs (0 allocations: 0 bytes)
0.7002576522570715

julia> @btime frequency_monobit_test3(Int64, $xb)
  19.423 ms (0 allocations: 0 bytes)
0.7002576522570715

julia> @btime frequency_monobit_test3(Int32, $xb)
  19.430 ms (0 allocations: 0 bytes)
0.7002576522570715

因此，sum上BitArray的速度非常快，因为您可以执行分块添加，但在循环中提取单个元素会产生一些开销。

结论：

与sum相比，您可以获得与BitArray相似或更好的效果，但Int32是一个非常特殊的情况。
如果您知道阵列的长度，并且知道Int8足以容纳您的总和，那么这可以节省时间。
将Int64累积到Bool时会发生一些奇怪的事情。我不知道为什么表现如此糟糕。
如果您只对0和1感兴趣，请使用Int8 s的数组，而不是Int32的数组，并且可能累积到BitArray。
sum在某些情况下可以超快。

Int8

sum(::Vector{Bool})速度异常快，是public ObservableCollection<string> Types { get { ObservableCollection<string> _types = new ObservableCollection<string>(); _types.Add("value1"); _types.Add("value2"); _types.Add("value3"); return _types; } }的两倍。

朱莉娅语。如何击败矢量化操作？

1 个答案: