Question

我正在解析日志文件，其中每行以Date开头，时间后跟系统事件消息。我想使用正则表达式来轻松匹配所需的日期和时间，而无需使用strptime或任何其他时间模块来进行计算。我试图将日期与9月12日和特定时间（9：23：45-09：23：50）相匹配，即记录5秒。日志文件采用以下格式：

Sep 12 09:23:45 localhost systemd: Switching root.
Sep 12 09:23:45 localhost journal: Journal stopped
Sep 12 09:23:46 localhost journal: Runtime journal is using 8.0M (max allowed 91.1M, trying to leave 136.7M free of 903.7M available ? current limit 91.1M).
Sep 12 09:23:46 localhost journal: Runtime journal is using 8.0M (max allowed 91.1M, trying to leave 136.7M free of 903.7M available ? current limit 91.1M).
Sep 12 09:23:46 localhost systemd-journald[88]: Received SIGTERM from PID 1 (systemd).

我试过的python代码：

import fileinput,re
for i in fileinput.input():
    if (re.search(r'Sep 12 09:23:[45-50]',i)):
        print(i)

此外，如果我试图解析超过100 GB的大文件，谁能告诉我这个相同代码的影响是什么？我可以重写此代码以减少内存开销吗？

Answer 1

我会选择一个略有不同的正则表达式：

^Sep 12 09:23:(?:4[5-9]|50)

说明：[45-50]是一个匹配4的字符类，5和5之间的所有内容以及0。这是因为字符类是char-by-char。对此的经典修复是通过数字前缀来定义备选方案：

(?:...)是一个用于节省一些资源的非录制组
4[5-9]匹配数字45，46，... 49
另一种选择是50，即你的间隔的上限。

演示here。

您可以确保只编译一次正则表达式。所以你的脚本使用更少的内存和CPU：

import fileinput,re
# this is the speedup
regex = re.compile('^Sep 12 09:23:(?:4[5-9]|50)')
for i in fileinput.input():
    # slightly different regex match call
    if (regex.match(i)):
        print(i)

如何使用python

1 个答案: