Question

在我的优化任务中，我发现内置的split（）方法比re.split（）等效方法快40％。

虚拟基准（易于复制 - 粘贴）：

import re, time, random 

def random_string(_len):
    letters = "ABC"
    return "".join([letters[random.randint(0,len(letters)-1)] for i in range(_len) ])

r = random_string(2000000)
pattern = re.compile(r"A")

start = time.time()
pattern.split(r)
print "with re.split : ", time.time() - start

start = time.time()
r.split("A")
print "with built-in split : ", time.time() - start

为什么会出现这种差异？

Answer 1

re.split 预期会变慢，因为正则表达式的使用会产生一些开销。

当然，如果您要分割常量字符串，则使用re.split()没有意义。

Answer 2

如有疑问，check the source code。您可以看到Python s.split()针对空白和内联进行了优化。但s.split()仅适用于固定分隔符。

对于速度权衡，基于re.split正则表达式的拆分更加灵活。

>>> re.split(':+',"One:two::t h r e e:::fourth field")
['One', 'two', 't h r e e', 'fourth field']
>>> "One:two::t h r e e:::fourth field".split(':')
['One', 'two', '', 't h r e e', '', '', 'fourth field']
# would require an addition step to find the empty fields...
>>> re.split('[:\d]+',"One:two:2:t h r e e:3::fourth field")
['One', 'two', 't h r e e', 'fourth field']
# try that without a regex split in an understandable way...

re.split()只慢29％（或者s.split()仅快40％）应该是惊人的。

Answer 3

运行正则表达式意味着您正在为每个字符运行状态机。使用常量字符串进行拆分意味着您只是在搜索字符串。第二个是一个不太复杂的程序。

Python re.split（）vs split（）

3 个答案: