使用python正则表达式拆分多行日志条目

时间:2018-03-14 07:49:09

标签: python regex logging multiline

我需要在python中创建一个正则表达式,它可以采用以下示例并拆分每个日志条目。我使用日期作为识别每个日志条目开头的方法,但它只能从日期开始到第一行结束的单行。它完全错过了所有堆栈跟踪内容。我想要所有的日志条目,因为有很多重复的日志记录,我希望能够过滤掉重复,并将其减少到少数独特的日志条目。我还希望能够删除任何有关字符串的唯一信息,例如日期时间戳,一旦我识别了日志条目,以便比较函数可以将其标记为重复。我试图使用积极的前瞻和多线标志,但无济于事。有人知道我想做什么吗?

我尝试了一些正则表达式

^\d{4}-\d{2}-\d{2}.*\(.*\)$ // it matches single line date to parenthesis
^(\d{4}-\d{2}-\d{2}|\s|).*\)$ // matches single line with tabs - not much better
^\d{4}-\d{2}-\d{2}.*(?=\d{4}-\d{2}-\d{2}) // positive lookahead but barely works

示例字符串:

2018-03-06 11:36:40:048 INFO:Starting.  (com.X.s.f.o.o)
2018-03-06 11:36:42:931 SEVERE: Error attempting to s: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)
2018-03-06 11:36:46:159 SEVERE: Error attempt: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)
2018-03-06 11:36:46:824 SEVERE: getConfigInteger(): eGSWindowsPortNumber    (com.Y.W.Y_Z_config_s.YZConfigs.getInteger)
2018-03-06 11:36:46:844 SEVERE: Failed to get (com.Y.W.Z_H.ZHGC.create)

期望的输出:

匹配1:

INFO:Starting.  (com.X.s.f.o.o)

比赛2:

SEVERE: Error attempting to s: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)

比赛3:

SEVERE: Error attempt: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)

比赛4:

SEVERE: getConfigInteger(): eGSWindowsPortNumber    (com.Y.W.Y_Z_config_s.YZConfigs.getInteger)

第5场比赛:

SEVERE: Failed to get (com.Y.W.Z_H.ZHGC.create)

2 个答案:

答案 0 :(得分:0)

Theres无需尝试将整个字符串与正则表达式匹配,您只需匹配日期并使用它将字符串分隔为所需的日志:

import re

sample="""2018-03-06 11:36:40:048 INFO:Starting.  (com.X.s.f.o.o)
2018-03-06 11:36:42:931 SEVERE: Error attempting to s: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)
2018-03-06 11:36:46:159 SEVERE: Error attempt: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)
2018-03-06 11:36:46:824 SEVERE: getConfigInteger(): eGSWindowsPortNumber    (com.Y.W.Y_Z_config_s.YZConfigs.getInteger)
2018-03-06 11:36:46:844 SEVERE: Failed to get (com.Y.W.Z_H.ZHGC.create)"""

def date_match(s):
    """Returns true if the beginning of this string matches a date and time."""
    return bool(re.match("\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}", s))

def yeild_matches(full_log):
    log = []
    for line in full_log.split("\n"):
        if date_match(line): # if this line starts with a date
            if len(log) > 0: # if theres already a log...
                yield "\n".join(log) # ... yield the log ...
                log = [] # ... and set the log back to nothing.

        log.append(line) # add the current line to log (list)

    yield "\n".join(log) # return the last log (theres no date at the end of the string to end the last log)

logs = list(yeild_matches(sample))

for i, l in enumerate(logs):
    print("Match {}:\n{}\n".format(i + 1, l))

yield_matches会将每行添加到名为log的列表中,直到找到另一个日期。当找到日期时,它yield是当前日志,并将日志设置为空。

看看输出结果如何:

Match 1:
2018-03-06 11:36:40:048 INFO:Starting.  (com.X.s.f.o.o)

Match 2:
2018-03-06 11:36:42:931 SEVERE: Error attempting to s: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)

Match 3:
2018-03-06 11:36:46:159 SEVERE: Error attempt: StatusRuntimeException   (com.Y.W.Z_H.ZHGC.sHToVe)
io.G.StatusRuntimeException: EXCEEDED
    at io.G.stub.CCalls.toStatusRuntimeException(CCalls.java:227)
    at io.G.stub.CCalls.getUnchecked(CCalls.java:208)
    at io.G.stub.CCalls.blockingUnaryCall(CCalls.java:141)

Match 4:
2018-03-06 11:36:46:824 SEVERE: getConfigInteger(): eGSWindowsPortNumber    (com.Y.W.Y_Z_config_s.YZConfigs.getInteger)

Match 5:
2018-03-06 11:36:46:844 SEVERE: Failed to get (com.Y.W.Z_H.ZHGC.create)

答案 1 :(得分:0)

在阅读以下信息后,我能够弄清楚:

python: multiline regular expression

https://www.safaribooksonline.com/library/view/python-cookbook-3rd/9781449357337/ch02s08.html

如果日志条目以日期^\d{4}-\d{2}-\d{2}开头,则以下正则表达式与日志条目匹配,并继续向前看(?=...),直到找到另一个日期条目.+?并将其返回为一场比赛。这匹配多行字符串! :d

^\d{4}-\d{2}-\d{2}.+?(?=\d{4}-\d{2}-\d{2})

以下正则表达式将与@Sean Breckenridge的解决方案完全相同,但这一次摆脱了我试图摆脱的字符串的独特部分。非常有用!

(?<=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}:\d{3}).+?(?=\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2}:\d{3}|\Z)