Question

我编写了一个Python实用程序来扫描日志文件中的已知错误模式。

我试图通过为正则表达式引擎提供额外的模式信息来加速搜索。例如，我不仅要查找包含gold的行，我要求此行必须以下划线开头，因此：^_.*gold而不是gold。

由于99％的线路没有以下划线开头，我期待获得巨大的性能回报，因为正则表达式引擎只能在一个字符后中止读取线路。从另一个方面我很惊讶。

以下程序说明了问题：

import re
from time import time
def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"^_") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    for p in patterns:
        start = time()
        for i in xrange(1000*1000):
            match = re.search(p, line)
        end = time() 
        print 'Elapsed: ' + str(end-start) 
main()

我已经尝试过审核sre_compile.py寻找解释，但它的代码对我来说太毛茸茸了。

是否可以通过包含行开头字符将正则表达式引擎操作从简单的子字符串op转换为更复杂的回溯状态机操作来解释观察到的性能？从而超过了在第一个角色之后中止搜索等任何好处？

这样想，我尝试将线的长度乘以x8，期望线搜索的开始闪耀，但间隙只会变宽（22秒对6秒）。

我很困惑：我在这里错过了什么吗？

Answer 1

怎么样

if line[0] == "_" and "gold" in line:
   print "Yup, it starts with an underscore"
else:
   print "Nope it doesn't"

说真的，不要过度使用正则表达式

Answer 2

你实际上做了两件事：如果你想查看字符串的开头，请使用match not search. 另外，请勿使用re.match( pattern, line)，编译模式并使用pattern.match(line)。

import re
from time import time
def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    for p in patterns:
        start = time()
        for i in xrange(1000*1000):
            match = p.match(line)
        end = time() 
        print 'Elapsed: ' + str(end-start) 
main()

你会看到你现在有了预期的行为 - 两种模式完全相同。

Answer 3

有趣的观察！我玩了一下。我的猜测是，正则表达式引擎将扫描整个字符串以获得下划线，并在找到匹配后将其与一行开头匹配。也许这与使用re.MULTILINE

时的统一行为有关

如果你使用re.match而不是重新搜索下划线模式，两者似乎都同样快，即

def main():
    line = r'I do not start with an underscore 123456789012345678901234567890'
    p1 = re.compile(r"_.*") # requires  underscore as a first char
    p2 = re.compile(r"abcdefghijklmnopqrstuvwxyz")
    patterns = (p1, p2)

    start = time()
    for i in xrange(1000*1000):
        match = re.match(p1, line)
    end = time() 
    print 'Elapsed: ' + str(end-start) 
    start = time()
    for i in xrange(1000*1000):
        match = re.search(p2, line)
    end = time() 
    print 'Elapsed: ' + str(end-start)

在这种情况下，匹配将需要匹配才能在字符串的开头匹配。

另外，请注意以下预编译模式的使用似乎更快：

for p in patterns:
    start = time()
    for i in xrange(1000*1000):
        match = p.search(line)
    end = time() 
    print 'Elapsed: ' + str(end-start)

但速度差异仍然存在......

Answer 4

正则表达式并不总是像您期望的那样。我不了解内部，所以我不能准确地解释这种行为。需要注意的一点是，如果从search更改为match，模式切换速度会更快（尽管这并不是您想要的）。

你做的是正确的事：测量和使用经验更快的技术。

给出开始字符时较慢的搜索是违反直觉的

4 个答案: