Question

我有一个~3000行的长日志文件，我需要找到第一次出现一些字符串。哪种方式最好/最有效的方法呢？

with open(filename, 'r') as f:
    match = re.search(r'^EXHAUST.*', f.read(), re.MULTILINE)

或

with open(filename, 'r') as f:
    for line in f:
        match = re.match(r'EXHAUST.*', line)

还是有更好的方式我没想到？

Answer 1

在这种情况下，您可以使用str.startswith：

作为更加pythonic的方式

with open(filename, 'r') as f:
    for line in f:
        if line.startswith('EXHAUST') :
           #do stuff

但是关于使用re.search vs re.match，如果你想匹配字符串，那么使用re.match就可以更有效地使用public StringList reverse(){ Node cursor = head; String temp; while(cursor!=null){ temp = cursor.getElement(); head = new Node(temp,head); cursor = cursor.getNext(); } return new StringList();}。

Answer 2

我喜欢你的第二个，但性能明智，因为你的正则表达式非常简单你可以使用startswith方法

with open(filename, 'r') as f:
    for line in f:
        match = line.startswith('EXHAUST')

Answer 3

您可以通过Python的日期时间库之类的简单实际检查算法使用的大致时间，例如：

import datetime

start = datetime.datetime.now()
# insert your code here #
end = datetime.datetime.now()

result = end - start
print(result)

事情是，用3000线时间消耗python算法，用这两种方法找到短语是低的。但是，从我的测试开始，如果文本位于文本末尾附近，则第一种方法会快一些。我测试了一个454kb的文本文件，超过3000行，大多数行是整段。数字约为0.09（下图）。但是，我必须提到没有^ regex符号来匹配字符串的开头，完成任务所花费的时间只有0.04s。

with open(filename, 'r') as f:
    match = re.search(phrase, f.read())

与

的0.12s相比

with open(filename, 'r') as f:
    i = 0
    for line in f:
        i += 1
        match = re.match(phrase, line)
            if match:
            break;

这里需要中断，否则匹配对象将是找到的最后一个匹配项，我用于查找我们找到匹配项的哪一行。因为.start和.end方法的位置否则将相对于我们所在的线。但是，在搜索方法上，默认情况下，您可以通过.start和.end匹配对象方法获取匹配位置。

然而在我的测试用例中，第一次出现在.txt文件的末尾附近，所以如果它接近开始第二种方法将占上风，因为它将停止在该行搜索，而第一种方法的时间消耗保持不变。 / p>

除非你为竞争性编码做这件事（不管怎样，Python可能不是最好的选择），两种方法都需要很少的时间。

Python - 大型日志文件上的正则表达式

3 个答案: