我正在测试不同版本的字符串清理并遇到下面的效果。在IPython的%timeit
警告或者这是真的时,我很难判断这是否真的是缓存的结果。请指教:
str.replace
:
def sanit2(s):
for c in ["'", '%', '"']:
s=s.replace(c,'')
return s
In [44]: %timeit sanit2(r""" ' ' % a % ' """)
The slowest run took 12.43 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 985 ns per loop
列表理解:
def sanit3(s):
removed = [x for x in s if not x in ["'", '%', '"']]
return ''.join(removed)
In [42]: %timeit sanit3(r""" ' ' % a % ' """)
The slowest run took 8.95 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 2.12 µs per loop
这似乎也适用于相对较长的字符串:
In [46]: reallylong = r""" ' ' % a % ' """ * 1000
In [47]: len(reallylong)
Out[47]: 22000
In [48]: %timeit sanit2(reallylong)
The slowest run took 4.94 times longer than the fastest. This could mean that an intermediate result is being cached
10000 loops, best of 3: 96.9 µs per loop
In [49]: %timeit sanit3(reallylong)
1000 loops, best of 3: 1.9 ms per loop
更新:我认为str.replace
也有更多或更少的O(n)复杂性,所以我希望sanit2
和sanit3
都有大约O (n ^ 2)复杂性。
我根据字符串长度测试了str.replace
的费用:
In [59]: orig_str = r""" ' ' % a % ' """
In [60]: for i in range(1,11):
....: longer = orig_str * i * 1000
....: %timeit longer.replace('%', '')
....:
10000 loops, best of 3: 44.2 µs per loop
10000 loops, best of 3: 87.8 µs per loop
10000 loops, best of 3: 131 µs per loop
10000 loops, best of 3: 177 µs per loop
1000 loops, best of 3: 219 µs per loop
1000 loops, best of 3: 259 µs per loop
1000 loops, best of 3: 311 µs per loop
1000 loops, best of 3: 349 µs per loop
1000 loops, best of 3: 398 µs per loop
1000 loops, best of 3: 435 µs per loop
In [61]: t="""10000 loops, best of 3: 44.2 s per loop
....: 10000 loops, best of 3: 87.8 s per loop
....: 10000 loops, best of 3: 131 s per loop
....: 10000 loops, best of 3: 177 s per loop
....: 1000 loops, best of 3: 219 s per loop
....: 1000 loops, best of 3: 259 s per loop
....: 1000 loops, best of 3: 311 s per loop
....: 1000 loops, best of 3: 349 s per loop
....: 1000 loops, best of 3: 398 s per loop
....: 1000 loops, best of 3: 435 s per loop"""
看起来很线性,但我确定了它的确定:
In [63]: averages=[]
In [66]: for idx, line in enumerate(t.split('\n')):
....: repl_time = line.rsplit(':',1)[1].split(' ')[1]
....: averages.append(float(repl_time)/(idx+1))
....:
In [67]: averages
Out[67]:
[44.2,
43.9,
43.666666666666664,
44.25,
43.8,
43.166666666666664,
44.42857142857143,
43.625,
44.22222222222222,
43.5]
是的,str.replace
几乎完全是O(n)。因此,在迭代要替换的字符列表之上,sanit2
应具有O(n ^ 2)复杂度,就像sanit3
一样(x for x in s
=>迭代字符串的字符要被替换,O(n)。...x in ["'", '%', '"']
应该是O(n)以及list.__contains__
成本。总共O(n ^ 2))。
所以在回复chepner
时,是的,sanit2
会执行固定数量的函数调用(很少,在示例中仅为3),但由于str.replace
的内部成本似乎sanit2
应该具有与sanit3
类似的复杂顺序。
差异是因为str.replace
在C中实现还是函数调用(list.__contains__
)也起了重要作用?
答案 0 :(得分:0)
sanit2
对C语言中实现的字符串方法进行固定数量的调用,与s
的长度无关。
sanit3
对s
进行可变数量的调用(list.__contains__
中每个元素一个),list
本身使用O(n),而不是O(1)算法。它还必须构造一个''.join
对象,然后在该列表上调用sanit2
。
{{1}}更快,这并不奇怪。