分割日志文件的功能。

Question

我正在使用Python日志记录在处理时生成日志文件，我正在尝试将这些日志文件读入list / dict，然后将其转换为JSON并加载到nosql数据库进行处理。

使用以下格式生成文件。

2015-05-22 16:46:46,985 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:46:56,645 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:47:46,488 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:48:48,180 - __main__ - ERROR - Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/
Traceback (most recent call last):
  File "<ipython-input-16-132cda1c011d>", line 10, in <module>
    if numFilesDownloaded == 0:
NameError: name 'numFilesDownloaded' is not defined
2015-05-22 16:49:17,918 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:49:32,160 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:49:39,329 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:53:30,706 - __main__ - INFO - Starting to Wait for Files

注意：在您看到的每个新日期之前实际上都有\ n中断，但似乎无法在此处表示。

基本上我正在尝试读取此文本文件并生成一个如下所示的json对象：

{
    'Date': '2015-05-22 16:46:46,985',
    'Type': 'INFO',
    'Message':'Starting to Wait for Files'
}
...

{
    'Date': '2015-05-22 16:48:48,180',
    'Type': 'ERROR',
    'Message':'Failed: Waiting for files the Files from Cloud Storage:  gs://folder/anotherfolder/ Traceback (most recent call last):
               File "<ipython-input-16-132cda1c011d>", line 10, in <module> if numFilesDownloaded == 0: NameError: name 'numFilesDownloaded' is not defined '
}

我遇到的问题：

我可以将每一行添加到列表或字典等中。但是ERROR消息有时会超过多行，所以我最终错误地将它拆分。

尝试：

我曾尝试使用如下代码仅在有效日期分割行，但我似乎无法获得跨越多行的错误消息。我也试过正则表达式并认为这是一个可能的解决方案，但似乎无法找到正确的正则表达式使用...不是CLUE如何工作所以尝试了一堆复制粘贴但没有任何成功。

with open(filename,'r') as f:
    for key,group in it.groupby(f,lambda line: line.startswith('2015')):
        if key:
            for line in group:
                listNew.append(line)

尝试了一些疯狂的正则表达但在这里没有运气：

logList = re.split(r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])', fileData)

非常感谢任何帮助......谢谢

修改

在下面为其他任何挣扎同样事情的人发布解决方案。

Answer 1

使用@Joran Beasley的回答我提出了以下解决方案，它似乎有效：

要点：

我的日志文件始终遵循相同的结构：{Date} - {Type} - {Message}所以我使用字符串切片和拆分来解决我的问题需要它们。示例{Date}始终为23个字符，仅限I 想要前19个字符。
使用line.startswith（＆＃34; 2015＆＃34;）是疯狂的，因为日期最终会改变，因此创建了一个新函数，它使用一些正则表达式来匹配我期望的日期格式。我的日志日期再次遵循特定模式，因此我可以获得具体的信息。
将文件读入第一个函数＆＃34; generateDicts（）＆＃34;然后调用＆＃34; matchDate（）＆＃34;函数看IF正在处理的行是否与我正在寻找的{Date}格式匹配。
每次找到有效的{Date}格式时都会创建一个新的dict，所有内容都会被处理，直到遇到NEXT有效{Date}为止。

分割日志文件的功能。

def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith(matchDate(line)):
            if currentDict:
                yield currentDict
            currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
        else:
            currentDict["text"] += line
    yield currentDict

with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
    listNew= list(generateDicts(f))

查看正在处理的行是否以与我正在寻找的格式相匹配的{Date}开始的函数

    def matchDate(line):
        matchThis = ""
        matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
        if matched:
            #matches a date and adds it to matchThis            
            matchThis = matched.group() 
        else:
            matchThis = "NONE"
        return matchThis

Answer 2

创建一个生成器（我今天在发电机弯道上）

def generateDicts(log_fh):
    currentDict = {}
    for line in log_fh:
        if line.startswith("2015"): #you might want a better check here
           if currentDict:
              yield currentDict
           currentDict = {"date":line.split("-")[0],"type":line.split("-")[2],"text":line.split("-")[-1]}
       else:
          currentDict["text"] += line
    yield currentDict

 with open("logfile.txt") as f:
    print list(generateDicts(f))

可能会有一些小错字错误...我实际上没有运行这个

Answer 3

您可以使用组直接从正则表达式获取您要查找的字段。你甚至可以命名它们：

>>> import re
>>> date_re = re.compile('(?P<a_year>\d{2,4})-(?P<a_month>\d{2})-(?P<a_day>\d{2}) (?P<an_hour>\d{2}):(?P<a_minute>\d{2}):(?P<a_second>\d{2}[.\d]*)')
>>> found = date_re.match('2016-02-29 12:34:56.789')
>>> if found is not None:
...     print found.groupdict()
... 
{'a_year': '2016', 'a_second': '56.789', 'a_day': '29', 'a_minute': '34', 'an_hour': '12', 'a_month': '02'}
>>> found.groupdict()['a_month']
'02'

然后创建一个日期类，其中构造函数的kwargs与组名匹配。使用一点**魔法直接从正则表达式groupdict创建对象的实例，然后用气体烹饪。在构造函数中，您可以确定2016年是闰年，2月29日退出。

-lrm

Answer 4

@ steven.levey提供的解决方案非常完美。我想做的一个补充是使用这个正则表达式模式来确定该行是否正确并提取所需的值。因此，在使用正则表达式确定格式后，我们不必再次分割行。

pattern = '(^[0-9\-\s\:\,]+)\s-\s__main__\s-\s([A-Z]+)\s-\s([\s\S]+)'

Answer 5

list = []
with open('bla.txt', 'rb') as file:
  for line in file.readlines():
    d = dict()
    if len(line.split(' - ')) >= 4:
      d['Date'] = line.split(' - ')[0]
      d['Type'] = line.split(' - ')[2]
      d['Message'] = line.split(' - ')[3]
      list.append(d)

输出：

[{
    'Date': '2015-05-22 16:46:46,985',
    'Message': 'Starting to Wait for Files\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:46:56,645',
    'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:47:46,488',
    'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:48:48,180',
    'Message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n',
    'Type': 'ERROR'
}, {
    'Date': '2015-05-22 16:49:17,918',
    'Message': 'Starting to Wait for Files\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:49:32,160',
    'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:49:39,329',
    'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
    'Type': 'INFO'
}, {
    'Date': '2015-05-22 16:53:30,706',
    'Message': 'Starting to Wait for Files',
    'Type': 'INFO'
}]

Answer 6

我最近有一个类似的解析日志记录的任务，但还有用于进一步分析的异常回溯。我没有用自制的正则表达式来对付我，而是使用了两个很棒的库：parse 用于解析记录（这实际上是一个非常酷的库，实际上是 stdlib 的 string.format 的反函数）和 {{3 }} 用于解析回溯。这是我从我的 impl 中提取的示例代码，适用于有问题的日志：

import datetime
import logging
import os
from pathlib import Path
from boltons.tbutils import ParsedException
from parse import parse, with_pattern


LOGGING_DEFAULT_DATEFMT = f"{logging.Formatter.default_time_format},%f"


# TODO better pattern
@with_pattern(r"\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d,\d\d\d")
def parse_logging_time(raw):
    return datetime.datetime.strptime(raw, LOGGING_DEFAULT_DATEFMT)


def from_log(file: os.PathLike, fmt: str):
    chunk = ""
    custom_parsers = {"asctime": parse_logging_time}

    with Path(file).open() as fp:
        for line in fp:
            parsed = parse(fmt, line, custom_parsers)
            if parsed is not None:
                yield parsed
            else:  # try parsing the stacktrace
                chunk += line
                try:
                    yield ParsedException.from_string(chunk)
                    chunk = ""
                except (IndexError, ValueError):
                    pass


if __name__ == "__main__":
    for parsed_record in from_log(
        file="so.log",
        fmt="{asctime:asctime} - {module} - {levelname} - {message}"
    ):
        print(parsed_record)

执行时，这会产生

<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 46, 46, 985000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 46, 56, 645000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 47, 46, 488000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 48, 48, 180000), 'module': '__main__', 'levelname': 'ERROR', 'message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n'}>
ParsedException('NameError', "name 'numFilesDownloaded' is not defined", frames=[{'filepath': '<ipython-input-16-132cda1c011d>', 'lineno': '10', 'funcname': '<module>', 'source_line': 'if numFilesDownloaded == 0:'}])
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 17, 918000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 32, 160000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 39, 329000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 53, 30, 706000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>

注意事项

如果您使用 { 样式指定日志格式，则很有可能您只需将日志格式字符串传递给 parse，它就会正常工作。在这个例子中，我不得不即兴发挥并使用自定义的时间戳解析器来匹配问题的要求；如果时间戳是通用格式，例如ISO 8601，可以只使用 fmt="{asctime:ti} - {module} - {levelname} - {message}" 并从示例代码中丢弃 parse_logging_time 和 custom_parsers。 parse 支持多种开箱即用的常见时间戳格式；查看boltons。

parse.Result 是类似 dict 的对象，因此 parsed_record["message"] 返回解析后的消息等。

注意打印的 ParsedException 对象 - 这是从回溯中解析出的异常。

如何在Python

6 个答案:

分割日志文件的功能。

查看正在处理的行是否以与我正在寻找的格式相匹配的{Date}开始的函数

注意事项