Question

我的代码的瓶颈目前是使用ctypes从Python列表转换为C数组，如in this question所述。

一个小实验表明，与其他Python指令相比，它确实非常慢：

1.790962941000089
0.0911122129996329
0.3200237319997541

给出：

perf

我用CPython 3.4.2获得了这些结果。我在CPython 2.7.9和Pypy 2.4.0上得到了类似的时间。

我尝试使用timeit运行上述代码，评论Performance counter stats for 'python3 perf.py': 1807,891637 task-clock (msec) # 1,000 CPUs utilized 8 context-switches # 0,004 K/sec 0 cpu-migrations # 0,000 K/sec 59 523 page-faults # 0,033 M/sec 5 755 704 178 cycles # 3,184 GHz 13 552 506 138 instructions # 2,35 insn per cycle 3 217 289 822 branches # 1779,581 M/sec 748 614 branch-misses # 0,02% of all branches 1,808349671 seconds time elapsed指令，一次只运行一个。我得到了这些结果：

ctypes的

 Performance counter stats for 'python3 perf.py':

        144,678718      task-clock (msec)         #    0,998 CPUs utilized          
                 0      context-switches          #    0,000 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
            12 913      page-faults               #    0,089 M/sec                  
       458 284 661      cycles                    #    3,168 GHz                    
     1 253 747 066      instructions              #    2,74  insn per cycle         
       325 528 639      branches                  # 2250,011 M/sec                  
           708 280      branch-misses             #    0,22% of all branches        

       0,144966969 seconds time elapsed

阵列

 Performance counter stats for 'python3 perf.py':

        369,786395      task-clock (msec)         #    0,999 CPUs utilized          
                 0      context-switches          #    0,000 K/sec                  
                 0      cpu-migrations            #    0,000 K/sec                  
           108 584      page-faults               #    0,294 M/sec                  
     1 175 946 161      cycles                    #    3,180 GHz                    
     2 086 554 968      instructions              #    1,77  insn per cycle         
       422 531 402      branches                  # 1142,636 M/sec                  
           768 338      branch-misses             #    0,18% of all branches        

       0,370103043 seconds time elapsed

设置

ctypes

set的代码具有比具有+------+ | | | A1 | Class A1 | | +--+---+ Class A2 - child ^ Class A3 - child | | A2->B1 (Class B1) +--------+--------+ A3->B2 (Class B2) | | | | +-----+ +--+---+ +--+--+ +-----+ | | | | | | | | | B1 +---+ A2 | | A3 +---+ B2 | | | | | | | | | +-----+ +------+ +-----+ +-----+的代码更少的页面错误，并且具有与其他两个相同的分支未命中数。我唯一看到的是有更多的指令和分支（但我仍然不知道为什么）和更多的上下文切换（但它肯定是更长的运行时间而不是原因的结果）。

因此我有两个问题：

为什么ctypes这么慢？
有没有办法提高性能，无论是使用ctype还是使用其他库？

Answer 1

虽然这不是一个明确的答案，但问题似乎是*t的构造函数调用。相反，执行以下操作可显着降低开销：

array =  (ctypes.c_uint32 * len(t))()
array[:] = t

测试：

import timeit
setup="from array import array; import ctypes; t = [i for i in range(1000000)];"
print(timeit.timeit(stmt='(ctypes.c_uint32 * len(t))(*t)',setup=setup,number=10))
print(timeit.timeit(stmt='a = (ctypes.c_uint32 * len(t))(); a[:] = t',setup=setup,number=10))
print(timeit.timeit(stmt='array("I",t)',setup=setup,number=10))
print(timeit.timeit(stmt='set(t)',setup=setup,number=10))

输出：

1.7090932869978133
0.3084979929990368
0.08278547400186653
0.2775516299989249

Answer 2

解决方案是使用array模块并转换地址或使用from_buffer方法......

import timeit
setup="from array import array; import ctypes; t = [i for i in range(1000000)];"
print(timeit.timeit(stmt="v = array('I',t);assert v.itemsize == 4; addr, count = v.buffer_info();p = ctypes.cast(addr,ctypes.POINTER(ctypes.c_uint32))",setup=setup,number=10))
print(timeit.timeit(stmt="v = array('I',t);a = (ctypes.c_uint32 * len(v)).from_buffer(v)",setup=setup,number=10))
print(timeit.timeit(stmt='(ctypes.c_uint32 * len(t))(*t)',setup=setup,number=10))
print(timeit.timeit(stmt='set(t)',setup=setup,number=10))

使用Python 3时速度提高了许多倍：

$ python3 convert.py
0.08303386811167002
0.08139665238559246
1.5630637975409627
0.3013848252594471

为什么ctypes将Python列表转换为C数组这么慢？

2 个答案: