测试

Question

我有一个循环遍历两个列表的进程，一个相对较大而另一个相当大。

示例：

larger_list = list(range(15000))
smaller_list = list(range(2500))

for ll in larger_list:
    for sl in smaller_list:            
        pass

我缩小了列表的大小以测试性能，我注意到哪个列表首先循环之间存在相当大的差异。

import timeit

larger_list = list(range(150))
smaller_list = list(range(25))


def large_then_small():
    for ll in larger_list:
        for sl in smaller_list:
            pass


def small_then_large():
    for sl in smaller_list:
        for ll in larger_list:
            pass


print('Larger -> Smaller: {}'.format(timeit.timeit(large_then_small)))
print('Smaller -> Larger: {}'.format(timeit.timeit(small_then_large)))

>>> Larger -> Smaller: 114.884992572
>>> Smaller -> Larger: 98.7751009799

乍一看，它们看起来完全相同 - 但是这两个功能之间存在16秒的差异。

为什么？

Answer 1

当你拆卸你的一个功能时，你会得到：

>>> dis.dis(small_then_large)
  2           0 SETUP_LOOP              31 (to 34)
              3 LOAD_GLOBAL              0 (smaller_list)
              6 GET_ITER
        >>    7 FOR_ITER                23 (to 33)
             10 STORE_FAST               0 (sl)

  3          13 SETUP_LOOP              14 (to 30)
             16 LOAD_GLOBAL              1 (larger_list)
             19 GET_ITER
        >>   20 FOR_ITER                 6 (to 29)
             23 STORE_FAST               1 (ll)

  4          26 JUMP_ABSOLUTE           20
        >>   29 POP_BLOCK
        >>   30 JUMP_ABSOLUTE            7
        >>   33 POP_BLOCK
        >>   34 LOAD_CONST               0 (None)
             37 RETURN_VALUE
>>>

查看地址29＆amp; 30，看起来这些将在每次内循环结束时执行。两个循环看起来基本相同，但每次内循环退出时都会执行这两个指令。在内部使用较小的数字将导致这些更频繁地执行，从而增加时间（相对于内循环上的较大数字）。

Answer 2

同样的现象正在this复制中进行讨论，让我对CPython的C土地上发生的事情感兴趣。用以下内容构建python：

% ./configure --enable-profiling
% make coverage

测试

% ./python -c "larger_list = list(range(15000))
smaller_list = list(range(2500))
for sl in smaller_list:
    for ll in larger_list: 
        pass"
% mv gmon.out soflgmon.out

% ./python -c "larger_list = list(range(15000))
smaller_list = list(range(2500))
for ll in larger_list:
    for sl in smaller_list:
        pass"
% mv gmon.out lofsgmon.out

结果

长名单的简短列表（单次运行的总时间1.60）：

% gprof python soflgmon.out|head -n40
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 46.25      0.74     0.74     3346     0.00     0.00  PyEval_EvalFrameEx
 25.62      1.15     0.41 37518735     0.00     0.00  insertdict
 14.38      1.38     0.23 37555121     0.00     0.00  lookdict_unicode_nodummy
  7.81      1.50     0.12 37506675     0.00     0.00  listiter_next
  4.06      1.57     0.07 37516233     0.00     0.00  PyDict_SetItem
  0.62      1.58     0.01     2095     0.00     0.00  _PyEval_EvalCodeWithName
  0.62      1.59     0.01        3     0.00     0.00  untrack_dicts
  0.31      1.59     0.01                             _PyDict_SetItem_KnownHash
  0.31      1.60     0.01                             listiter_len
  0.00      1.60     0.00    87268     0.00     0.00  visit_decref
  0.00      1.60     0.00    73592     0.00     0.00  visit_reachable
  0.00      1.60     0.00    71261     0.00     0.00  _PyThreadState_UncheckedGet
  0.00      1.60     0.00    49742     0.00     0.00  _PyObject_Alloc
  0.00      1.60     0.00    48922     0.00     0.00  PyObject_Malloc
  0.00      1.60     0.00    48922     0.00     0.00  _PyObject_Malloc
  0.00      1.60     0.00    47487     0.00     0.00  PyDict_GetItem
  0.00      1.60     0.00    44246     0.00     0.00  _PyObject_Free
  0.00      1.60     0.00    43637     0.00     0.00  PyObject_Free
  0.00      1.60     0.00    30034     0.00     0.00  slotptr
  0.00      1.60     0.00    24892     0.00     0.00  type_is_gc
  0.00      1.60     0.00    24170     0.00     0.00  r_byte
  0.00      1.60     0.00    23774     0.00     0.00  PyErr_Occurred
  0.00      1.60     0.00    20371     0.00     0.00  _PyType_Lookup
  0.00      1.60     0.00    19930     0.00     0.00  PyLong_FromLong
  0.00      1.60     0.00    19758     0.00     0.00  r_string
  0.00      1.60     0.00    19080     0.00     0.00  _PyLong_New
  0.00      1.60     0.00    18887     0.00     0.00  lookdict_unicode
  0.00      1.60     0.00    18878     0.00     0.00  long_dealloc
  0.00      1.60     0.00    17639     0.00     0.00  PyUnicode_InternInPlace
  0.00      1.60     0.00    17502     0.00     0.00  rangeiter_next
  0.00      1.60     0.00    14776     0.00     0.00  PyObject_GC_UnTrack
  0.00      1.60     0.00    14578     0.00     0.00  descr_traverse
  0.00      1.60     0.00    13520     0.00     0.00  r_long
  0.00      1.60     0.00    13058     0.00     0.00  PyUnicode_New
  0.00      1.60     0.00    12298     0.00     0.00  _Py_CheckFunctionResult
  ...

长短列表（单次运行的总时间1.64）：

gprof python lofsgmon.out|head -n40
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 48.78      0.80     0.80     3346     0.00     0.00  PyEval_EvalFrameEx
 17.99      1.09     0.29 37531168     0.00     0.00  insertdict
 11.59      1.28     0.19 37531675     0.00     0.00  listiter_next
 11.28      1.47     0.18 37580156     0.00     0.00  lookdict_unicode_nodummy
  6.71      1.58     0.11 37528666     0.00     0.00  PyDict_SetItem
  1.22      1.60     0.02                             _PyDict_SetItem_KnownHash
  0.61      1.61     0.01     5525     0.00     0.00  update_one_slot
  0.61      1.62     0.01      120     0.00     0.00  PyDict_Merge
  0.30      1.62     0.01    18178     0.00     0.00  lookdict_unicode
  0.30      1.63     0.01    11988     0.00     0.00  insertdict_clean
  0.30      1.64     0.01                             listiter_len
  0.30      1.64     0.01                             listiter_traverse
  0.00      1.64     0.00    96089     0.00     0.00  _PyThreadState_UncheckedGet
  0.00      1.64     0.00    87245     0.00     0.00  visit_decref
  0.00      1.64     0.00    74743     0.00     0.00  visit_reachable
  0.00      1.64     0.00    62232     0.00     0.00  _PyObject_Alloc
  0.00      1.64     0.00    61412     0.00     0.00  PyObject_Malloc
  0.00      1.64     0.00    61412     0.00     0.00  _PyObject_Malloc
  0.00      1.64     0.00    59815     0.00     0.00  PyDict_GetItem
  0.00      1.64     0.00    55231     0.00     0.00  _PyObject_Free
  0.00      1.64     0.00    54622     0.00     0.00  PyObject_Free
  0.00      1.64     0.00    36274     0.00     0.00  PyErr_Occurred
  0.00      1.64     0.00    30034     0.00     0.00  slotptr
  0.00      1.64     0.00    24929     0.00     0.00  type_is_gc
  0.00      1.64     0.00    24617     0.00     0.00  _PyObject_GC_Alloc
  0.00      1.64     0.00    24617     0.00     0.00  _PyObject_GC_Malloc
  0.00      1.64     0.00    24170     0.00     0.00  r_byte
  0.00      1.64     0.00    20958     0.00     0.00  PyObject_GC_Del
  0.00      1.64     0.00    20371     0.00     0.00  _PyType_Lookup
  0.00      1.64     0.00    19918     0.00     0.00  PyLong_FromLong
  0.00      1.64     0.00    19758     0.00     0.00  r_string
  0.00      1.64     0.00    19068     0.00     0.00  _PyLong_New
  0.00      1.64     0.00    18845     0.00     0.00  long_dealloc
  0.00      1.64     0.00    18507     0.00     0.00  _PyObject_GC_New
  0.00      1.64     0.00    17639     0.00     0.00  PyUnicode_InternInPlace
  ...

差异是边际的（2.4％），并且分析增加了运行时间，因此很难说它实际上会有多少。总时间还包括创建测试列表，以便进一步隐藏真正的差异。

原始测试中16s差异的原因是默认情况下timeit.timeit运行给定的语句或函数number=1000000次，因此在这种情况下会增加高达40,000s。不要引用那个值，因为它是一个分析工件。使用您在此计算机上的原始测试代码和非分析python3，我得到：

Larger -> Smaller: 40.29234626500056
Smaller -> Larger: 33.09413992699956

这意味着差异

In [1]: (40.29234626500056-33.09413992699956)/1000000
Out[1]: 7.198206338001e-06

每次运行（7.2μs），总共18％。

正如former answer中所述，POP_BLOCK执行得更多，但不仅如此，而是整个内部循环设置：

  0.00      1.64     0.00    16521     0.00     0.00  PyFrame_BlockSetup
  0.00      1.64     0.00    16154     0.00     0.00  PyFrame_BlockPop

与长名单的简短列表相比：

  0.00      1.60     0.00     4021     0.00     0.00  PyFrame_BlockSetup
  0.00      1.60     0.00     3748     0.00     0.00  set_next
  0.00      1.60     0.00     3654     0.00     0.00  PyFrame_BlockPop

虽然影响可以忽略不计。

为什么嵌套循环的顺序之间存在性能差异？

2 个答案:

测试

结果