在Python 3中,re
模块可以与memoryview
:
~$ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = b"abc"
>>> import re
>>> re.search(b"b", memoryview(x))
<_sre.SRE_Match object at 0x7f14b5fb8988>
然而,在Python 2中,情况似乎并非如此:
~$ python
Python 2.7.3 (default, Mar 13 2014, 11:03:55)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = "abc"
>>> import re
>>> re.search(b"b", memoryview(x))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/re.py", line 142, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or buffer
我可以将字符串转换为buffer
,但是查看buffer documentation,它没有提及buffer
与memoryview
相比的确切方式。< / p>
进行实证比较表明,在Python 2中使用buffer
对象不能提供在Python 3中使用memoryview
的性能优势:
playground$ cat speed-test.py
import timeit
import sys
print(timeit.timeit("regex.search(mv[10:])", setup='''
import re
regex = re.compile(b"ABC")
PYTHON_3 = sys.version_info >= (3, )
if PYTHON_3:
mv = memoryview(b"Can you count to three or sing 'ABC?'" * 1024)
else:
mv = buffer(b"Can you count to three or sing 'ABC?'" * 1024)
'''))
playground$ python2.7 speed-test.py
2.33041596413
playground$ python2.7 speed-test.py
2.3322429657
playground$ python3.2 speed-test.py
0.381270170211792
playground$ python3.2 speed-test.py
0.3775448799133301
playground$
如果regex.search
参数从mv[10:]
更改为mv
,Python 2的性能与Python 3大致相同,但在我写的代码中,有很多重复字符串切片。
有没有办法绕过Python 2中的这个问题,同时仍然具有memoryview
的零拷贝性能优势?
答案 0 :(得分:2)
我理解Python 2中buffer object的方式,你应该在没有切片的情况下使用它:
>>> s = b"Can you count to three or sing 'ABC?'"
>>> str(buffer(s, 10))
"unt to three or sing 'ABC?'"
因此,不是切片生成的缓冲区,而是直接使用缓冲区函数来执行切片,从而快速访问您感兴趣的子字符串:
import timeit
import sys
import re
r = re.compile(b'ABC')
s = b"Can you count to three or sing 'ABC?'" * 1024
PYTHON_3 = sys.version_info >= (3, )
if len(sys.argv) > 1: # standard slicing
print(timeit.timeit("r.search(s[10:])", setup='from __main__ import r, s'))
elif PYTHON_3: # memoryview in Python 3
print(timeit.timeit("r.search(s[10:])", setup='from __main__ import r, s; s = memoryview(s)'))
else: # buffer in Python 2
print(timeit.timeit("r.search(buffer(s, 10))", setup='from __main__ import r, s'))
我在Python 2和3中得到了非常相似的结果,这表明在buffer
模块中使用re
具有与较新的memoryview
(当时似乎是一个懒惰评估的缓冲区):
$ python2 .\speed-test.py
0.681979371561
$ python3 .\speed-test.py
0.5693422508853488
与标准字符串切片比较:
$ python2 .\speed-test.py standard-slicing
7.92006735956
$ python3 .\speed-test.py standard-slicing
7.817641705304309
如果你想支持切片访问(以便你可以在任何地方使用相同的语法),你可以轻松地创建一个类型,当你切片时动态创建一个新的缓冲区:
class slicingbuffer:
def __init__ (self, source):
self.source = source
def __getitem__ (self, index):
if not isinstance(index, slice):
return buffer(self.source, index, 1)
elif index.stop is None:
return buffer(self.source, index.start)
else:
end = max(index.stop - index.start, 0)
return buffer(self.source, index.start, end)
如果您只将其与re
模块一起使用,它可能可以作为memoryview
的直接替代品。但是,我的测试显示这已经给你带来了很大的开销。所以你可能想要做相反的事情并将Python 3的memoryview对象包装在一个包装器中,它提供与buffer
相同的接口:
def memoryviewbuffer (source, start, end = -1):
return source[start:end]
PYTHON_3 = sys.version_info >= (3, )
if PYTHON_3:
b = memoryviewbuffer
s = memoryview(s)
else:
b = buffer
print(timeit.timeit("r.search(b(s, 10))", setup='from __main__ import r, s, b'))