找到正确的正则表达式来汇总日志?

时间:2019-02-09 23:41:45

标签: python regex logging

我想查找所有带有“相似”错误消息的日志,并计算每种日志的出现次数。问题是错误消息中经常有一些动态部分。

例如,给定错误消息,例如

"Didn't accept value 3 for parameter foo"
"Didn't accept value 6 for parameter bar"
"Could not open file 'my_file.json' because: it does not exist"
"Could not open file 'my_other_file.json' because: it is not 
formatted correctly"

我希望能够统计这些日志的出现,以便最终得到如下输出:

"Didn't accept value * for parameter *" -- 2 counts
"Could not open file * because: it does not exist" -- 2 counts

编写正则表达式的问题在于,来自多个团队的日志消息格式多种多样。我不得不写几十个正则表达式来结束计数,而且我还会留下很长的未计数日志消息

是否有某种方法可以检测日志何时具有动态部分并进行汇总?

1 个答案:

答案 0 :(得分:0)

你的意思是这样吗?

import re

logs = [
    "Didn't accept value 3 for parameter foo",
    "Didn't accept value 6 for parameter bar",
    "Could not open file 'my_file.json' because: it does not exist",
    "Could not open file 'my_other_file.json' because: it is not formatted correctly",
]

counts = {
    "Didn't accept value * for parameter *": 0,
    "Could not open file * because: *": 0
}

for log in logs:
    s = re.search(r"Didn't accept value \d+ for parameter \w+", log)
    if s:
        counts["Didn't accept value * for parameter *"] += 1
        continue
    s = re.search(r"Could not open file '[^']+' because: \w+", log)
    if s:
        counts["Could not open file * because: *"] += 1
        continue

print(counts)