列表理解真的比str.replace慢得多吗?

时间:2016-02-24 17:55:17

标签: python performance list-comprehension

我正在测试不同版本的字符串清理并遇到下面的效果。在IPython的%timeit警告或者这是真的时,我很难判断这是否真的是缓存的结果。请指教:

str.replace

def sanit2(s):    
    for c in ["'", '%', '"']:
        s=s.replace(c,'')
    return s


In [44]: %timeit sanit2(r"""   '   '    % a % '   """)
The slowest run took 12.43 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 985 ns per loop    

列表理解:

def sanit3(s):    
    removed = [x for x in s if not x in ["'", '%', '"']]
    return ''.join(removed)


In [42]: %timeit sanit3(r"""   '   '    % a % '   """)
The slowest run took 8.95 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 2.12 µs per loop        

这似乎也适用于相对较长的字符串:

In [46]: reallylong = r"""   '   '    % a % '   """ * 1000

In [47]: len(reallylong)
Out[47]: 22000


In [48]: %timeit sanit2(reallylong)
The slowest run took 4.94 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 96.9 µs per loop


In [49]: %timeit sanit3(reallylong)
1000 loops, best of 3: 1.9 ms per loop        

更新:我认为str.replace也有更多或更少的O(n)复杂性,所以我希望sanit2sanit3都有大约O (n ^ 2)复杂性。

我根据字符串长度测试了str.replace的费用:

In [59]: orig_str = r"""   '   '    % a % '   """


In [60]: for i in range(1,11):
   ....:     longer = orig_str * i * 1000
   ....:     %timeit longer.replace('%', '')
   ....:
10000 loops, best of 3: 44.2 µs per loop
10000 loops, best of 3: 87.8 µs per loop
10000 loops, best of 3: 131 µs per loop
10000 loops, best of 3: 177 µs per loop
1000 loops, best of 3: 219 µs per loop
1000 loops, best of 3: 259 µs per loop
1000 loops, best of 3: 311 µs per loop
1000 loops, best of 3: 349 µs per loop
1000 loops, best of 3: 398 µs per loop
1000 loops, best of 3: 435 µs per loop


In [61]: t="""10000 loops, best of 3: 44.2 s per loop
   ....: 10000 loops, best of 3: 87.8 s per loop
   ....: 10000 loops, best of 3: 131 s per loop
   ....: 10000 loops, best of 3: 177 s per loop
   ....: 1000 loops, best of 3: 219 s per loop
   ....: 1000 loops, best of 3: 259 s per loop
   ....: 1000 loops, best of 3: 311 s per loop
   ....: 1000 loops, best of 3: 349 s per loop
   ....: 1000 loops, best of 3: 398 s per loop
   ....: 1000 loops, best of 3: 435 s per loop"""

看起来很线性,但我确定了它的确定:

In [63]: averages=[]   


In [66]: for idx, line in enumerate(t.split('\n')):
   ....:     repl_time = line.rsplit(':',1)[1].split(' ')[1]
   ....:     averages.append(float(repl_time)/(idx+1))
   ....:

In [67]: averages
Out[67]:
[44.2,
 43.9,
 43.666666666666664,
 44.25,
 43.8,
 43.166666666666664,
 44.42857142857143,
 43.625,
 44.22222222222222,
 43.5]

是的,str.replace几乎完全是O(n)。因此,在迭代要替换的字符列表之上,sanit2应具有O(n ^ 2)复杂度,就像sanit3一样(x for x in s =>迭代字符串的字符要被替换,O(n)。...x in ["'", '%', '"']应该是O(n)以及list.__contains__成本。总共O(n ^ 2))。

所以在回复chepner时,是的,sanit2会执行固定数量的函数调用(很少,在示例中仅为3),但由于str.replace的内部成本似乎sanit2应该具有与sanit3类似的复杂顺序。

差异是因为str.replace在C中实现还是函数调用(list.__contains__)也起了重要作用?

1 个答案:

答案 0 :(得分:0)

sanit2对C语言中实现的字符串方法进行固定数量的调用,与s的长度无关。

sanit3s进行可变数量的调用(list.__contains__中每个元素一个),list本身使用O(n),而不是O(1)算法。它还必须构造一个''.join对象,然后在该列表上调用sanit2

{{1}}更快,这并不奇怪。