Question

解决了最近的Advent of Code problem，我发现我的默认Python比PyPy慢了约40倍。通过限制对len的调用并通过在函数中运行来限制全局查找，我能够使用this code将其降低到大约17倍。

现在，e.py在python 3.6.3上运行5.162秒，在我的机器上运行PyPy时运行.297秒。

我的问题是：这是JIT的不可缩短的加速，还是有某种方法可以加速CPython的回答？（没有极端的意思：我可以去Cython / Numba或其他什么？）我如何说服自己，我无能为力？

请参阅要点以获取数字输入文件列表。

如the problem statement所述，它们代表跳跃偏移。 position += offsets[current]，并将当前偏移量增加1.当跳转将您带到列表之外时，您就完成了。

这是给出的示例（需要5秒的完整输入更长，并且数字更大）：

(0) 3  0  1  -3  - before we have taken any steps.
(1) 3  0  1  -3  - jump with offset 0 (that is, don't jump at all). Fortunately, the instruction is then incremented to 1.
 2 (3) 0  1  -3  - step forward because of the instruction we just modified. The first instruction is incremented again, now to 2.
 2  4  0  1 (-3) - jump all the way to the end; leave a 4 behind.
 2 (4) 0  1  -2  - go back to where we just were; increment -3 to -2.
 2  5  0  1  -2  - jump 4 steps forward, escaping the maze.

代码：

def run(cmds):
    location = 0
    counter = 0
    while 1:
        try:
            cmd = cmds[location]
            if cmd >= 3:
                cmds[location] -= 1
            else:
                cmds[location] += 1
            location += cmd
            if location < 0:
                print(counter)
                break
            counter += 1
        except:
            print(counter)
            break

if __name__=="__main__":
    text = open("input.txt").read().strip().split("\n")
    cmds = [int(cmd) for cmd in text]
    run(cmds)

编辑：我使用Cython编译并运行代码，将运行时间降至2.53秒，但这仍然比PyPy慢了几个数量级。

编辑：Numba gets me to within 2x

编辑：最好的Cython I could write降至1.32秒，略高于4倍的pypy

编辑：按照@viraptor的建议将cmd变量移动到cdef中，使Cython降低到.157秒！几乎快一整个数量级，而不是远离常规python。尽管如此，我还是对PyPy JIT给予了极大的印象，它完全免费完成了这一切！

Answer 1

作为Python的基线，我用C编写了这个（然后决定使用C ++来获得更方便的数组I / O）。它使用clang ++有效地编译x86-64。使用Skylake x86 时，问题中的代码运行速度比CPython3.6.2快82倍，因此即使是速度更快的Python版本，仍然有几个因素可以跟上接近最佳的机器代码。（是的，我查看了编译器的asm输出以检查它是否做得很好）。

让一个好的JIT或者提前编译器看到循环逻辑是这里性能的关键。问题逻辑本质上是串行的，所以没有空间让Python运行已经编译C在整个数组上做一些事情（比如NumPy），因为除非你使用Cython或其他东西，否则不会为这个特定问题编译C.让问题的每一步都回到CPython解释器就是性能的死亡，当它的缓慢没有被内存瓶颈或任何东西隐藏时。

更新：将偏移数组转换为指针数组，将其加速另一个因子1.5倍（简单寻址模式+从关键路径循环中移除add依赖链，将其降低到4 cycle L1D load-use latency的简单寻址模式（when the pointer comes from another load），而不是6c = 5c + 1c，用于索引寻址模式+ add延迟）。

但我认为我们可以对Python慷慨而不期望它能跟上算法转换以适应现代CPU：P（特别是因为我使用32位指针即使在64位模式下也能确保4585元素阵列仍然只有18kiB所以它适合32kiB L1D缓存。就像Linux x32 ABI那样，或AArch64 ILP32 ABI。）

此外，更新的替代表达式使gcc编译为无分支，就像clang一样。（在这个答案中留下了注释掉的原始perf stat输出，因为看到无网点与错误预测的影响很有意思。）

unsigned jumps(int offset[], unsigned size) {
    unsigned location = 0;
    unsigned counter = 0;

    do {
          //location += offset[location]++;            // simple version
          // >=3 conditional version below

        int off = offset[location];

        offset[location] += (off>=3) ? -1 : 1;       // branchy with gcc
        // offset[location] = (off>=3) ? off-1 : off+1;  // branchless with gcc and clang.  

        location += off;

        counter++;
    } while (location < size);

    return counter;
}

#include <iostream>
#include <iterator>
#include <vector>

int main()
{
    std::ios::sync_with_stdio(false);     // makes cin faster
    std::istream_iterator<int> begin(std::cin), dummy;
    std::vector<int> values(begin, dummy);   // construct a dynamic array from reading stdin

    unsigned count = jumps(values.data(), values.size());
    std::cout << count << '\n';
}

使用clang4.0.1 -O3 -march=skylake，内部循环是无分支的;它对>=3条件使用条件移动。我使用? :，因为这是我希望编译器能做的。 Source + asm on the Godbolt compiler explorer

.LBB1_4:                                # =>This Inner Loop Header: Depth=1
    mov     ebx, edi               ; silly compiler: extra work inside the loop to save code outside
    mov     esi, dword ptr [rax + 4*rbx]  ; off = offset[location]
    cmp     esi, 2
    mov     ecx, 1
    cmovg   ecx, r8d               ; ecx = (off>=3) ? -1 : 1;  // r8d = -1 (set outside the loop)
    add     ecx, esi               ; off += -1 or 1
    mov     dword ptr [rax + 4*rbx], ecx  ; store back the updated off
    add     edi, esi               ; location += off  (original value)
    add     edx, 1                 ; counter++
    cmp     edi, r9d
    jb      .LBB1_4                ; unsigned compare against array size

这是我的i7-6700k Skylake CPU上perf stat ./a.out < input.txt（对于clang版本）的输出：

21841249        # correct total, matches Python

 Performance counter stats for './a.out':

         36.843436      task-clock (msec)         #    0.997 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
               119      page-faults               #    0.003 M/sec                  
       143,680,934      cycles                    #    3.900 GHz                    
       245,059,492      instructions              #    1.71  insn per cycle         
        22,654,670      branches                  #  614.890 M/sec                  
            20,171      branch-misses             #    0.09% of all branches        

       0.036953258 seconds time elapsed

由于循环中的数据依赖性，每时钟的平均指令远低于4。下一次迭代的加载地址取决于此迭代的加载+添加，而无序执行无法隐藏。但是，它可以重叠更新当前位置值的所有工作。

从int更改为short没有性能下降（正如预期的那样; movsx has the same latency as mov on Skylake），但内存消耗减半，因此如果需要，更大的数组可以适合L1D缓存。

我尝试将数组编译到程序中（如int offsets[] = { file contents with commas added };所以不必读取和解析它。它也使得大小成为编译时常量。这将运行时间减少到~36.2 + / - 0.1毫秒，低于~36.8，因此从文件读取的版本仍然将大部分时间花在实际问题上，而不是解析输入。（与Python不同，C ++的启动开销可以忽略不计，而我的Skylake CPU也是如此由于Skylake的硬件P状态管理，最高时钟速度可以达到毫秒级。）

如前所述，使用[rdi]而不是[rdi + rdx*4]这样的简单寻址模式进行指针追踪可以降低1c的延迟，并避免add（index += offset变为{{ 1}}）。英特尔，因为IvyBridge在寄存器之间具有零延迟整数current = target，因此不会延长关键路径。这是 the source (with comments) + asm for this hacky optimization 。典型的运行（文本解析为mov）： std::vector ，90.725 M个周期（3.900 GHz），23.26 +- 0.05 ms（3.18 IPC）。有趣的是，它实际上是更多的总指令，但运行得更快，因为循环携带的依赖链的延迟较低。

gcc使用分支，速度大约慢2倍。（在整个程序中，根据288.724 M instructions错误预测了14％的分支。它只是作为更新值的一部分进行分支，但是分支未命中使管道停止，因此它们也会影响关键路径，在某种程度上数据依赖性不在这里。这似乎是优化器的糟糕决定。）

重写条件为perf stat说服gcc使用asm看起来很好看。

gcc7.1.1 -O3 -march = skylake （对于使用offset[location] = (off>=3) ? off-1 : off+1;分支进行编译的原始来源。）

(off <= 3) ? : -1 : +1

<强> VS。 CPython（Arch Linux上的Python3.6.2）：

Performance counter stats for './ec-gcc':

     70.032162      task-clock (msec)         #    0.998 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
           118      page-faults               #    0.002 M/sec                  
   273,115,485      cycles                    #    3.900 GHz                    
   255,088,412      instructions              #    0.93  insn per cycle         
    44,382,466      branches                  #  633.744 M/sec                  
     6,230,137      branch-misses             #   14.04% of all branches        

   0.070181924 seconds time elapsed

我没有安装PyPy或其他Python实现，抱歉。

Answer 2

你可以改善小事，但是pypy（很可能）在这项任务中总是会更快。

对于CPython和Cython：

不要立刻读完整个文件。您可以随时迭代行，这可以节省（重新）分配成本。这要求您预先分配数组。
也许从列表切换到array。

对于Cython：

Mark the variable types。特别是作为int的计数器和作为int的数组的命令，以跳过大多数类型检查。

PyPy比Python快17倍。 Python可以加速吗？

2 个答案: