Question

我有一些python代码，我试图移植到Julia学习这种可爱的语言。我在python中使用了生成器。在移植之后，在我看来（此时此刻）Julia在这个领域真的很慢！

我将部分代码简化为本练习：

想想4x4棋盘。找到每一个N-move长路，国际象棋王可以做到。在这个练习中，国王不允许在一条路径中的同一位置跳跃两次。不要浪费记忆 - ＆gt;制作每条路径的发电机。

算法非常简单：

如果我们用数字签署每个位置：

0  1  2  3
4  5  6  7
8  9  10 11
12 13 14 16

点0有3个邻居（1,4,5）。我们可以为每个点找到每个邻居的表格：

NEIG = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]]

PYTHON

一个递归函数（生成器），它从点列表或（生成器......）点的生成器放大给定路径：

def enlarge(path):
    if isinstance(path, list):
        for i in NEIG[path[-1]]:
            if i not in path:
                yield path[:] + [i]
    else:
        for i in path:
            yield from enlarge(i)

函数（生成器）给出具有给定长度的每个路径

def paths(length):
    steps = ([i] for i in range(16))  # first steps on every point on board
    for _ in range(length-1):
        nsteps = enlarge(steps)
        steps = nsteps
    yield from steps

我们可以看到有905776个长度为10的路径：

sum(1 for i in paths(10))
Out[89]: 905776

JULIA （this code是在我们的讨论{@ 3}}

期间由@gggg创建的

const NEIG_py = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];
const NEIG = [n.+1 for n in NEIG_py]
function enlarge(path::Vector{Int})
    (push!(copy(path),loc) for loc in NEIG[path[end]] if !(loc in path))
end
collect(enlarge([1]))
function enlargepaths(paths)
    Iterators.Flatten(enlarge(path) for path in paths)
end
collect(enlargepaths([[1],[2]]))
function paths(targetlen)
    paths = ([i] for i=1:16)
    for newlen in 2:targetlen
        paths = enlargepaths(paths)
    end
    paths
end
p = sum(1 for path in paths(10))

基准

在ipython中我们可以计时：

python 3.6.3：

%timeit sum(1 for i in paths(10))
1.25 s ± 15.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

julia 0.6.0

julia> @time sum(1 for path in paths(10))
  2.690630 seconds (41.91 M allocations: 1.635 GiB, 11.39% gc time)
905776

Julia 0.7.0-DEV.0

julia> @time sum(1 for path in paths(10))
  4.951745 seconds (35.69 M allocations: 1.504 GiB, 4.31% gc time)
905776

问题（S）：

我们Julians是here：重要的是要注意基准代码不是为绝对最大性能而编写的（计算recursion_fibonacci（20）的最快代码是常量文字6765）。相反，基准测试用于测试每种语言中实现的相同算法和代码模式的性能。

在此基准测试中，我们使用相同的想法。对于封闭到生成器的数组而言，只需简单循环（numpy，numba，pandas或其他c-written和编译的python包都没有）

假设朱莉娅的发电机非常慢吗？

我们可以做些什么才能让它变得非常快？

Answer 1

const NEIG_py = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];
const NEIG = [n.+1 for n in NEIG_py];

function expandto(n, path, targetlen)
    length(path) >= targetlen && return n+1
    for loc in NEIG[path[end]]
        loc in path && continue
        n = expandto(n, (path..., loc), targetlen)
    end
    n
end

function npaths(targetlen)
    n = 0
    for i = 1:16
        path = (i,)
        n = expandto(n, path, targetlen)
    end
    n
end

基准测试（在执行JIT编译后执行一次）：

julia> @time npaths(10)
  0.069531 seconds (5 allocations: 176 bytes)
905776

这要快得多。

Answer 2

朱莉娅比Python更“出色的表现”并不神奇。其中大部分直接源于这样一个事实，即Julia可以弄清楚函数中每个变量的类型，然后为这些特定类型编译高度专业化的代码。这甚至适用于许多容器中的元素和像发电机这样的迭代物;朱莉娅经常提前知道元素的类型。 Python几乎不能轻易地（或者在很多情况下）进行这种分析，因此它的优化专注于改善动态行为。

为了让Julia的生成器提前知道它们可能生成什么类型的类型，它们封装了有关它们执行的操作和它们在类型中迭代的对象的信息：

julia> (1 for i in 1:16)
Base.Generator{UnitRange{Int64},getfield(Main, Symbol("##27#28"))}(getfield(Main, Symbol("##27#28"))(), 1:16)

奇怪的##27#28事件是一个简单返回1的匿名函数的类型。当生成器到达LLVM时，它知道足以执行大量优化：

julia> function naive_sum(c)
           s = 0
           for elt in c
               s += elt
           end
           s
       end
       @code_llvm naive_sum(1 for i in 1:16)

; Function naive_sum
; Location: REPL[1]:2
define i64 @julia_naive_sum_62385({ { i64, i64 } } addrspace(11)* nocapture nonnull readonly dereferenceable(16)) {
top:
; Location: REPL[1]:3
  %1 = getelementptr inbounds { { i64, i64 } }, { { i64, i64 } } addrspace(11)* %0, i64 0, i32 0, i32 0
  %2 = load i64, i64 addrspace(11)* %1, align 8
  %3 = getelementptr inbounds { { i64, i64 } }, { { i64, i64 } } addrspace(11)* %0, i64 0, i32 0, i32 1
  %4 = load i64, i64 addrspace(11)* %3, align 8
  %5 = add i64 %4, 1
  %6 = sub i64 %5, %2
; Location: REPL[1]:6
  ret i64 %6
}

在那里解析LLVM IR可能需要一分钟，但您应该能够看到它只是提取UnitRange（getelementptr和load）的端点，从彼此中减去它们（sub）并添加一个来计算总和而不需要单个循环。

在这种情况下，它可以对抗朱莉娅：paths(10)有一个非常复杂的类型！你正在迭代地将那个生成器包装在过滤器中并展平并且还有更多的生成器。事实上，它变得如此复杂，朱莉娅只是放弃试图弄清楚并决定与动态行为一起生活。而且在这一点上，它不再具有超越Python的固有优势 - 实际上专注于许多不同的类型，因为它递归遍历对象将是一个明显的障碍。您可以通过查看@code_warntype start(1 for i in paths(10))来查看此操作。

我对朱莉娅表现的经验法则是type-stable，devectorized代码avoids allocations通常在C的2倍之内，动态，不稳定或矢量化代码在Python / MATLAB /其他更高级语言的一个数量级。通常它有点慢，因为其他更高级别的语言非常难以优化他们的情况，而Julia的大部分优化都集中在类型稳定的方面。这个深层嵌套的结构让你直接进入动态阵营。

朱莉娅的发电机也非常慢？本质上不是这样;只是当它们变得如此深深地嵌套时，你就会遇到这种不好的情况。

Answer 3

不遵循相同的算法（并且不知道Python会像这样快速地做到这一点），但是使用以下代码，对于长度= 10的解决方案，Julia基本相同，并且对于长度= 16

In [48]: %timeit sum(1 for path in paths(10))                                                                                                          
1.52 s ± 11.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)                                                                                    

julia> @time sum(1 for path in pathsr(10))                                                                                                             
  1.566964 seconds (5.54 M allocations: 693.729 MiB, 16.24% gc time)                                                                                   
905776                                                                                                                                                 

In [49]: %timeit sum(1 for path in paths(16))                                                                                                          
19.3 s ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)                                                                                    

julia> @time sum(1 for path in pathsr(16))                                                                                                             
  6.491803 seconds (57.36 M allocations: 9.734 GiB, 33.79% gc time)                                                                                    
343184

这是代码。我昨天刚学会了任务/频道，所以可能会做得更好：

const NEIG = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], \
[5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];

function enlarger(num::Int,len::Int,pos::Int,sol::Array{Int64,1},c::Channel)
    if pos == len
        put!(c,copy(sol))
    elseif pos == 0
        for j=0:num
            sol[1]=j
            enlarger(num,len,pos+1,sol,c)
        end
        close(c)
    else
        for i in NEIG[sol[pos]+1]
            if !in(i,sol[1:pos])
                sol[pos+1]=i
                enlarger(num,len,pos+1,sol,c)
            end
        end
    end
end

function pathsr(len)
    c=Channel(0)
    sol = [0 for i=1:len]
    @schedule enlarger(15,len,0,sol,c)
    (i for i in c)
end

Answer 4

按照tholy的回答，因为元组似乎非常快。这就像我之前的代码一样，但是有了元组的东西，它会得到更好的结果：

julia> @time sum(1 for i in pathst(10))
  1.155639 seconds (1.83 M allocations: 97.632 MiB, 0.75% gc time)
905776

julia> @time sum(1 for i in pathst(16))
  1.963470 seconds (1.39 M allocations: 147.555 MiB, 0.35% gc time)
343184

代码：

const NEIG = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];

function enlarget(path,len,c::Channel)
    if length(path) >= len
        put!(c,path)
    else
        for loc in NEIG[path[end]+1]
            loc in path && continue
            enlarget((path..., loc), len,c)
        end
        if length(path) == 1
            path[1] == 15 ? close(c) : enlarget((path[1]+1,),len,c)
        end
    end
end

function pathst(len)
    c=Channel(0)
    path=(0,)
    @schedule enlarget(path,len,c)
    (i for i in c)
end

Answer 5

由于每个人都在写答案......这是另一个版本，这次是使用Iterators，它比当前Julia（0.6.1）中的生成器更加惯用。迭代器提供了发电机的许多好处。迭代器定义如下：

import Base.Iterators: start, next, done, eltype, iteratoreltype, iteratorsize

struct SAWsIterator
    neigh::Vector{Vector{Int}}
    pathlen::Int
    pos::Int
end

SAWs(neigh, pathlen, pos) = SAWsIterator(neigh, pathlen, pos)

start(itr::SAWsIterator) = 
    ([itr.pos ; zeros(Int, itr.pathlen-1)], Vector{Int}(itr.pathlen-1),
     2, Ref{Bool}(false), Ref{Bool}(false))

@inline next(itr::SAWsIterator, s) = 
    ( s[4][] ? s[4][] = false : calc_next!(itr, s) ; 
      (s[1], (s[1], s[2], itr.pathlen, s[4], s[5])) )

@inline done(itr::SAWsIterator, s) = ( s[4][] || calc_next!(itr, s) ; s[5][] )

function calc_next!(itr::SAWsIterator, s)
    s[4][] = true ; s[5][] = false
    curindex = s[3]
    pathlength = itr.pathlen
    path, options = s[1], s[2]
    @inbounds while curindex<=pathlength
        curindex == 1 && ( s[5][] = true ; break )
        startindex = path[curindex] == 0 ? 1 : options[curindex-1]+1
        path[curindex] = 0
        i = findnext(x->!(x in path), neigh[path[curindex-1]], startindex)
        if i==0
            path[curindex] = 0 ; options[curindex-1] = 0 ; curindex -= 1
        else
            path[curindex] = neigh[path[curindex-1]][i]
            options[curindex-1] = i ; curindex += 1
        end
    end
    return nothing
end

eltype(::Type{SAWsIterator}) = Vector{Int}
iteratoreltype(::Type{SAWsIterator}) = Base.HasEltype()
iteratorsize(::Type{SAWsIterator}) = Base.SizeUnknown()

剪切并粘贴上面的定义有效。 SAW一词被用作Self Avoiding Walk的首字母缩写，有时在数学中用于这样的路径。

现在，要使用/测试此迭代器，可以执行以下代码：

allSAWs(neigh, pathlen) = 
  Base.Flatten(SAWs(neigh,pathlen,k) for k in eachindex(neigh))

iterlength(itr) = mapfoldl(x->1, +, 0, itr)

using Base.Test

const neigh = [[2, 5, 6], [1, 3, 5, 6, 7], [2, 4, 6, 7, 8], [3, 7, 8], 
  [1, 2, 6, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 4, 6, 8, 10, 11, 12], 
  [3, 4, 7, 11, 12], [5, 6, 10, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], 
  [6, 7, 8, 10, 12, 14, 15, 16], [7, 8, 11, 15, 16], [9, 10, 14], 
  [9, 10, 11, 13, 15], [10, 11, 12, 14, 16], [11, 12, 15]]

@test iterlength(allSAWs(neigh, 10)) == 905776

for (i,path) in enumerate(allSAWs(neigh, 10))
    if i % 100_000 == 0
        @show i,path
    end
end

@time iterlength(allSAWs(neigh, 10))

它相对可读，输出如下：

(i, path) = (100000, [2, 5, 10, 14, 9, 6, 7, 12, 15, 11])
(i, path) = (200000, [4, 3, 8, 7, 6, 10, 14, 11, 16, 15])
(i, path) = (300000, [5, 10, 11, 16, 15, 14, 9, 6, 7, 3])
(i, path) = (400000, [8, 3, 6, 5, 2, 7, 11, 14, 15, 10])
(i, path) = (500000, [9, 14, 10, 5, 2, 3, 8, 11, 6, 7])
(i, path) = (600000, [11, 16, 15, 14, 10, 6, 3, 8, 7, 12])
(i, path) = (700000, [13, 10, 15, 16, 11, 6, 2, 1, 5, 9])
(i, path) = (800000, [15, 11, 12, 7, 2, 3, 6, 1, 5, 9])
(i, path) = (900000, [16, 15, 14, 9, 5, 10, 7, 8, 12, 11])
  0.130755 seconds (4.16 M allocations: 104.947 MiB, 11.37% gc time)
905776

0.13s并不算太糟糕，因为这不像@tholy的答案或其他人那样优化。其他答案中使用的一些技巧在这里故意不使用，特别是：

递归基本上使用堆栈作为分配的快捷方式。
在每个元组签名的第一次方法编译中，使用元组结合专门化隐藏了一些运行时复杂性。

在答案中没有看到的优化可能很重要的是使用有效的Bool数组或Dict来加速检查路径中是否已经使用了顶点。在这个答案中，findnext触发了一个分配，这可以避免，然后这个答案将更接近最小内存分配数。

Answer 6

这是我快速而又肮脏的作弊实验（我答应将其添加到评论中），我试图加速Angel的代码：

const NEIG_py = [[1, 4, 5], [0, 2, 4, 5, 6], [1, 3, 5, 6, 7], [2, 6, 7], [0, 1, 5, 8, 9], [0, 1, 2, 4, 6, 8, 9, 10], [1, 2, 3, 5, 7, 9, 10, 11], [2, 3, 6, 10, 11], [4, 5, 9, 12, 13], [4, 5, 6, 8, 10, 12, 13, 14], [5, 6, 7, 9, 11, 13, 14, 15], [6, 7, 10, 14, 15], [8, 9, 13], [8, 9, 10, 12, 14], [9, 10, 11, 13, 15], [10, 11, 14]];
const NEIG = [n.+1 for n in NEIG_py]

function enlargetc(path,len,c::Function)
    if length(path) >= len
        c(path)
    else
        for loc in NEIG[path[end]]
            loc in path && continue
            enlargetc((path..., loc), len,c)
        end
        if length(path) == 1
            if path[1] == 16 return
            else enlargetc((path[1]+1,),len,c)
            end
        end
    end
end

function get_counter()
    let helper = 0
        function f(a)
            helper += 1
            return helper
        end
        return f
    end
end

counter = get_counter()
@time enlargetc((1,), 10, counter)  # 0.481986 seconds (2.62 M allocations: 154.576 MiB, 5.12% gc time)
counter.helper.contents  # 905776

编辑：评论中的时间没有重新编译！重新编译后，它是0.201669 seconds (2.53 M allocations: 150.036 MiB, 10.77% gc time)。

朱莉娅 - 国王的方式（发电机性能）

6 个答案: