有没有办法在Python 2中使用带有正则表达式的memoryview?

时间:2015-04-26 07:00:06

标签: python

在Python 3中,re模块可以与memoryview

一起使用
~$ python3
Python 3.2.3 (default, Feb 20 2013, 14:44:27)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = b"abc"
>>> import re
>>> re.search(b"b", memoryview(x))
<_sre.SRE_Match object at 0x7f14b5fb8988>

然而,在Python 2中,情况似乎并非如此:

~$ python
Python 2.7.3 (default, Mar 13 2014, 11:03:55)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> x = "abc"
>>> import re
>>> re.search(b"b", memoryview(x))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/re.py", line 142, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

我可以将字符串转换为buffer,但是查看buffer documentation,它没有提及buffermemoryview相比的确切方式。< / p>

进行实证比较表明,在Python 2中使用buffer对象不能提供在Python 3中使用memoryview的性能优势:

playground$ cat speed-test.py
import timeit
import sys

print(timeit.timeit("regex.search(mv[10:])", setup='''
import re
regex = re.compile(b"ABC")
PYTHON_3 = sys.version_info >= (3, )
if PYTHON_3:
    mv = memoryview(b"Can you count to three or sing 'ABC?'" * 1024)
else:
    mv = buffer(b"Can you count to three or sing 'ABC?'" * 1024)
'''))
playground$ python2.7 speed-test.py
2.33041596413
playground$ python2.7 speed-test.py
2.3322429657
playground$ python3.2 speed-test.py
0.381270170211792
playground$ python3.2 speed-test.py
0.3775448799133301
playground$

如果regex.search参数从mv[10:]更改为mv,Python 2的性能与Python 3大致相同,但在我写的代码中,有很多重复字符串切片。

有没有办法绕过Python 2中的这个问题,同时仍然具有memoryview的零拷贝性能优势?

1 个答案:

答案 0 :(得分:2)

我理解Python 2中buffer object的方式,你应该在没有切片的情况下使用它:

>>> s = b"Can you count to three or sing 'ABC?'"
>>> str(buffer(s, 10))
"unt to three or sing 'ABC?'"

因此,不是切片生成的缓冲区,而是直接使用缓冲区函数来执行切片,从而快速访问您感兴趣的子字符串:

import timeit
import sys
import re

r = re.compile(b'ABC')
s = b"Can you count to three or sing 'ABC?'" * 1024

PYTHON_3 = sys.version_info >= (3, )
if len(sys.argv) > 1: # standard slicing
    print(timeit.timeit("r.search(s[10:])", setup='from __main__ import r, s'))
elif PYTHON_3: # memoryview in Python 3
    print(timeit.timeit("r.search(s[10:])", setup='from __main__ import r, s; s = memoryview(s)'))
else: # buffer in Python 2
    print(timeit.timeit("r.search(buffer(s, 10))", setup='from __main__ import r, s'))

我在Python 2和3中得到了非常相似的结果,这表明在buffer模块中使用re具有与较新的memoryview(当时似乎是一个懒惰评估的缓冲区):

$ python2 .\speed-test.py
0.681979371561
$ python3 .\speed-test.py
0.5693422508853488

与标准字符串切片比较:

$ python2 .\speed-test.py standard-slicing
7.92006735956
$ python3 .\speed-test.py standard-slicing
7.817641705304309

如果你想支持切片访问(以便你可以在任何地方使用相同的语法),你可以轻松地创建一个类型,当你切片时动态创建一个新的缓冲区:

class slicingbuffer:
    def __init__ (self, source):
        self.source = source
    def __getitem__ (self, index):
        if not isinstance(index, slice):
            return buffer(self.source, index, 1)
        elif index.stop is None:
            return buffer(self.source, index.start)
        else:
            end = max(index.stop - index.start, 0)
            return buffer(self.source, index.start, end)

如果您只将其与re模块一起使用,它可能可以作为memoryview的直接替代品。但是,我的测试显示这已经给你带来了很大的开销。所以你可能想要做相反的事情并将Python 3的memoryview对象包装在一个包装器中,它提供与buffer相同的接口:

def memoryviewbuffer (source, start, end = -1):
    return source[start:end]

PYTHON_3 = sys.version_info >= (3, )
if PYTHON_3:
    b = memoryviewbuffer
    s = memoryview(s)
else:
    b = buffer

print(timeit.timeit("r.search(b(s, 10))", setup='from __main__ import r, s, b'))