Question

我正在处理一个500MB的文件。使用re.search时处理时间增加。

请查看我测试过的以下案例。在所有情况下，我逐行读取文件并仅使用一个if条件。

情况1：

prnt = re.compile(r"(?i)<spanlevel level='7'>")
if prnt.search(line):
print "Matched"
out_file.write(line)
else:
out_file.write(line)

这需要16秒才能读取整个文件。

情况2：

if re.search(r"(?i)<spanlevel level='7'>",line):
print "Matched"
out_file.write(line)
else:
out_file.write(line)

这需要25秒才能读取文件。

情形3：

if "<spanlevel level='7'>" in line:
print "Matched"
out_file.write(line)
else:
out_file.write(line)

只需8秒即可读取文件。

请你们中的任何一个人知道这三种情况之间的差异。和Case3处理速度非常快，但我无法进行不区分大小写的匹配。如何在Case3中进行不区分大小写的匹配？

Answer 1

首先对案例3进行不区分大小写的搜索：

if "<spanlevel level='7'>" in line.lower():

通过小写line，您可以将其设为小写搜索。

至于为什么情况2的速度要慢得多：使用预编译的正则表达式会更快，因为您可以避免对从文件中读取的每一行的正则表达式模式进行缓存查找。如果没有缓存副本，那么re.search()也会调用re.compile()，并且额外的函数调用和缓存检查会花费你。

这对Python 3.3来说是双重痛苦的，它使用functools.lru_cache decorator切换到一个新的缓存模型，实际上比以前的实现慢。见Why are uncompiled, repeatedly used regexes so much slower in Python 3?

使用in的简单文本搜索对于精确文本匹配更快。正则表达式非常适合复杂匹配，您只需查找完全匹配，尽管不区分大小写。