Question

我从服务器检索到错误日志数据，它采用以下格式：

文字文件：

2018-01-09 04:50:25,226 [18] INFO messages starts here line1 \n   
    line2 above error continued in next line  
2018-01-09 04:50:29,226 [18] ERROR messages starts here line1 \n  
    line2 above error continued in next line  
2018-01-09 05:50:29,226 [18] ERROR messages starts here line1 \n 
    line2 above error continued in next line

我需要检索错误/信息性消息以及日期时间戳。

在python中写下面的代码并且如果错误消息只在一行中它的工作正常但是如果在多行中记录相同的错误它不能正常工作（在这种情况下它只给出一行作为输出，但我还需要下一行如果那属于同一错误）。

如果您提供任何解决方案/想法，将会有所帮助。

以下是代码：

 f = open('text.txt', 'r', encoding="Latin-1")
 import re    
 strr=re.findall(r'(\d{4}-\d{1,2}-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2})(\,\d{1,3}\s\[\d{1,3}\]\s)(INFO|ERROR)(.*)$', f.read(), re.MULTILINE)
 print(strr)

以上代码输出为：

[（＆＃39; 2018-01-09 04：50：25＆＃39;，＆＃39;，226 [18]＆＃39;，＆＃39; INFO＆＃39;，＆＃39;消息从这里开始 line1＆＃39;），（＆＃39; 2018-01-09 04：50：29＆＃39;，＆＃39;，226 [18]＆＃39;，＆＃39; ERROR＆＃39;，＆＃ 39;消息开始这里第1行＆＃39;），（＆＃39; 2018-01-09 05：50：25＆＃39;，＆＃39;，226 [18]＆＃39;，＆＃39; ERROR＆＃39;，＆＃39;消息从这里开始第1行＆＃39;）]

我希望输出为

[（＆＃39; 2018-01-09 04：50：25＆＃39;，＆＃39;，226 [18]＆＃39;，＆＃39; INFO＆＃39;，＆＃39;消息从line1开始第二行错误继续在下一行＆＃39; ），（＆＃39; 2018年1月9日 04：50：29＆＃39;，＆＃39;，226 [18]＆＃39;，＆＃39; ERROR＆＃39;，＆＃39;消息从这里开始第1行第2行错误在下一行继续＆＃39; ），（＆＃39; 2018-01-09 05：50：29＆＃39;，＆＃39;，226 [18]＆＃39;＆＃39;错误＆＃39;，＆＃39;消息从这里开始第1行第2行以上错误继续在下一行＆＃39; ）]

Answer 1

正则表达式：(\d{4}(?:-\d{2}){2}\s\d{2}(?::\d{2}){2})(,\d+[^\]]+\])\s(INFO|ERROR)\s([\S\s]+?)(?=\r?\n\d{4}(?:-\d{2}){2}|$)

Python代码：

import re

matches = re.findall(r'(\d{4}(?:-\d{2}){2}\s\d{2}(?::\d{2}){2})(,\d+[^\]]+\])\s(INFO|ERROR)\s([\S\s]+?)(?=\r?\n\d{4}(?:-\d{2}){2}|$)', text)

输出：

[('2018-01-09 04:50:25', ',226 [18]', 'INFO', 'messages starts here line1\nline2 above error continued in next line'), ('2018-01-09 04:50:29', ',226 [18]', 'ERROR', 'messages starts here line1\nline2 above error continued in next line'), ('2018-01-09 05:50:29', ',226 [18]', 'ERROR', 'messages starts here line1\nline2 above error continued in next line')]

Code demo

Answer 2

在正则表达式中添加\ n：

(\d{4}-\d{1,2}-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2})(\,\d{1,3}\s\[\d{1,3}\]\s)(INFO|ERROR)(.*\n.*)

Answer 3

您可以使用 lookahead 表达式并搜索<date1>（包含）和<date2>（已排除）结构之间的匹配项。在您的情况下，每个日志记录都以<date>结构开头。您还需要删除$，因为re.MULTILINE会匹配新行。

修改

你可以做得更好。一找到<date>结构，就逐行运行。开始收集新的日志记录，直到您观察到新的<date>结构。连接与一条记录相关的日志行并执行regex。转到下一条记录。

Answer 4

这可能并不像你希望的那样整洁，但是没有什么可以阻止你逐行检查并累积错误信息：

import re

example = '''2018-01-09 04:50:25,226 [18] INFO messages starts here line1
    line2 above error continued in next line
2018-01-09 04:50:29,226 [18] ERROR messages starts here line1
    line2 above error continued in next line
2018-01-09 05:50:29,226 [18] ERROR messages starts here line1
    line2 above error continued in next line  '''

output = []

for line in example.splitlines():
    match = re.match(r'(\d{4}-\d{1,2}-\d{1,2}\s\d{1,2}:\d{1,2}:\d{1,2})'
                     r'(\,\d{1,3}\s\[\d{1,3}\]\s)(INFO|ERROR)(.*)',
                     line, re.MULTILINE + re.VERBOSE)
    if match:
        output.append(list(match.groups()))
    # Check that output already exists - in case of headers
    elif output:
        output[-1].append(line)

返回

[['2018-01-09 04:50:25', ',226 [18] ', 'INFO', ' messages starts here line1', '    line2 above error continued in next line'], ['2018-01-09 04:50:29', ',226 [18] ', 'ERROR', ' messages starts here line1', '    line2 above error continued in next line'], ['2018-01-09 05:50:29', ',226 [18] ', 'ERROR', ' messages starts here line1', '    line2 above error continued in next line  ']]

检索数据，直到它与下一个正则表达式模式匹配

4 个答案: