我正在使用Python日志记录在处理时生成日志文件,我正在尝试将这些日志文件读入list / dict,然后将其转换为JSON并加载到nosql数据库进行处理。
使用以下格式生成文件。
2015-05-22 16:46:46,985 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:46:56,645 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:47:46,488 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:48:48,180 - __main__ - ERROR - Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/
Traceback (most recent call last):
File "<ipython-input-16-132cda1c011d>", line 10, in <module>
if numFilesDownloaded == 0:
NameError: name 'numFilesDownloaded' is not defined
2015-05-22 16:49:17,918 - __main__ - INFO - Starting to Wait for Files
2015-05-22 16:49:32,160 - __main__ - INFO - Starting: Attempt 1 Checking for New Files from gs://folder/folder/
2015-05-22 16:49:39,329 - __main__ - INFO - Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1
2015-05-22 16:53:30,706 - __main__ - INFO - Starting to Wait for Files
注意:在您看到的每个新日期之前实际上都有\ n中断,但似乎无法在此处表示。
基本上我正在尝试读取此文本文件并生成一个如下所示的json对象:
{
'Date': '2015-05-22 16:46:46,985',
'Type': 'INFO',
'Message':'Starting to Wait for Files'
}
...
{
'Date': '2015-05-22 16:48:48,180',
'Type': 'ERROR',
'Message':'Failed: Waiting for files the Files from Cloud Storage: gs://folder/anotherfolder/ Traceback (most recent call last):
File "<ipython-input-16-132cda1c011d>", line 10, in <module> if numFilesDownloaded == 0: NameError: name 'numFilesDownloaded' is not defined '
}
我遇到的问题:
我可以将每一行添加到列表或字典等中。但是ERROR消息有时会超过多行,所以我最终错误地将它拆分。
尝试:
我曾尝试使用如下代码仅在有效日期分割行,但我似乎无法获得跨越多行的错误消息。我也试过正则表达式并认为这是一个可能的解决方案,但似乎无法找到正确的正则表达式使用...不是CLUE如何工作所以尝试了一堆复制粘贴但没有任何成功。
with open(filename,'r') as f:
for key,group in it.groupby(f,lambda line: line.startswith('2015')):
if key:
for line in group:
listNew.append(line)
尝试了一些疯狂的正则表达但在这里没有运气:
logList = re.split(r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])', fileData)
非常感谢任何帮助......谢谢
修改
在下面为其他任何挣扎同样事情的人发布解决方案。
答案 0 :(得分:7)
使用@Joran Beasley的回答我提出了以下解决方案,它似乎有效:
要点:
def generateDicts(log_fh):
currentDict = {}
for line in log_fh:
if line.startswith(matchDate(line)):
if currentDict:
yield currentDict
currentDict = {"date":line.split("__")[0][:19],"type":line.split("-",5)[3],"text":line.split("-",5)[-1]}
else:
currentDict["text"] += line
yield currentDict
with open("/Users/stevenlevey/Documents/out_folder/out_loyaltybox/log_CardsReport_20150522164636.logs") as f:
listNew= list(generateDicts(f))
def matchDate(line):
matchThis = ""
matched = re.match(r'\d\d\d\d-\d\d-\d\d\ \d\d:\d\d:\d\d',line)
if matched:
#matches a date and adds it to matchThis
matchThis = matched.group()
else:
matchThis = "NONE"
return matchThis
答案 1 :(得分:3)
创建一个生成器(我今天在发电机弯道上)
def generateDicts(log_fh):
currentDict = {}
for line in log_fh:
if line.startswith("2015"): #you might want a better check here
if currentDict:
yield currentDict
currentDict = {"date":line.split("-")[0],"type":line.split("-")[2],"text":line.split("-")[-1]}
else:
currentDict["text"] += line
yield currentDict
with open("logfile.txt") as f:
print list(generateDicts(f))
可能会有一些小错字错误...我实际上没有运行这个
答案 2 :(得分:2)
您可以使用组直接从正则表达式获取您要查找的字段。你甚至可以命名它们:
>>> import re
>>> date_re = re.compile('(?P<a_year>\d{2,4})-(?P<a_month>\d{2})-(?P<a_day>\d{2}) (?P<an_hour>\d{2}):(?P<a_minute>\d{2}):(?P<a_second>\d{2}[.\d]*)')
>>> found = date_re.match('2016-02-29 12:34:56.789')
>>> if found is not None:
... print found.groupdict()
...
{'a_year': '2016', 'a_second': '56.789', 'a_day': '29', 'a_minute': '34', 'an_hour': '12', 'a_month': '02'}
>>> found.groupdict()['a_month']
'02'
然后创建一个日期类,其中构造函数的kwargs与组名匹配。使用一点**魔法直接从正则表达式groupdict创建对象的实例,然后用气体烹饪。在构造函数中,您可以确定2016年是闰年,2月29日退出。
-lrm
答案 3 :(得分:0)
@ steven.levey提供的解决方案非常完美。我想做的一个补充是使用这个正则表达式模式来确定该行是否正确并提取所需的值。因此,在使用正则表达式确定格式后,我们不必再次分割行。
pattern = '(^[0-9\-\s\:\,]+)\s-\s__main__\s-\s([A-Z]+)\s-\s([\s\S]+)'
答案 4 :(得分:0)
list = []
with open('bla.txt', 'rb') as file:
for line in file.readlines():
d = dict()
if len(line.split(' - ')) >= 4:
d['Date'] = line.split(' - ')[0]
d['Type'] = line.split(' - ')[2]
d['Message'] = line.split(' - ')[3]
list.append(d)
输出:
[{
'Date': '2015-05-22 16:46:46,985',
'Message': 'Starting to Wait for Files\n',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:46:56,645',
'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:47:46,488',
'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:48:48,180',
'Message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n',
'Type': 'ERROR'
}, {
'Date': '2015-05-22 16:49:17,918',
'Message': 'Starting to Wait for Files\n',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:49:32,160',
'Message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:49:39,329',
'Message': 'Success: Downloading the Files from Cloud Storage: Return Code',
'Type': 'INFO'
}, {
'Date': '2015-05-22 16:53:30,706',
'Message': 'Starting to Wait for Files',
'Type': 'INFO'
}]
答案 5 :(得分:0)
我最近有一个类似的解析日志记录的任务,但还有用于进一步分析的异常回溯。我没有用自制的正则表达式来对付我,而是使用了两个很棒的库:parse
用于解析记录(这实际上是一个非常酷的库,实际上是 stdlib 的 string.format
的反函数)和 {{3 }} 用于解析回溯。这是我从我的 impl 中提取的示例代码,适用于有问题的日志:
import datetime
import logging
import os
from pathlib import Path
from boltons.tbutils import ParsedException
from parse import parse, with_pattern
LOGGING_DEFAULT_DATEFMT = f"{logging.Formatter.default_time_format},%f"
# TODO better pattern
@with_pattern(r"\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d,\d\d\d")
def parse_logging_time(raw):
return datetime.datetime.strptime(raw, LOGGING_DEFAULT_DATEFMT)
def from_log(file: os.PathLike, fmt: str):
chunk = ""
custom_parsers = {"asctime": parse_logging_time}
with Path(file).open() as fp:
for line in fp:
parsed = parse(fmt, line, custom_parsers)
if parsed is not None:
yield parsed
else: # try parsing the stacktrace
chunk += line
try:
yield ParsedException.from_string(chunk)
chunk = ""
except (IndexError, ValueError):
pass
if __name__ == "__main__":
for parsed_record in from_log(
file="so.log",
fmt="{asctime:asctime} - {module} - {levelname} - {message}"
):
print(parsed_record)
执行时,这会产生
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 46, 46, 985000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 46, 56, 645000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 47, 46, 488000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 48, 48, 180000), 'module': '__main__', 'levelname': 'ERROR', 'message': 'Failed: Waiting for files the Files from Cloud Storage: gs://folder/folder/\n'}>
ParsedException('NameError', "name 'numFilesDownloaded' is not defined", frames=[{'filepath': '<ipython-input-16-132cda1c011d>', 'lineno': '10', 'funcname': '<module>', 'source_line': 'if numFilesDownloaded == 0:'}])
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 17, 918000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 32, 160000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting: Attempt 1 Checking for New Files from gs://folder/folder/\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 49, 39, 329000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Success: Downloading the Files from Cloud Storage: Return Code - 0 and FileCount 1\n'}>
<Result () {'asctime': datetime.datetime(2015, 5, 22, 16, 53, 30, 706000), 'module': '__main__', 'levelname': 'INFO', 'message': 'Starting to Wait for Files\n'}>
如果您使用 {
样式指定日志格式,则很有可能您只需将日志格式字符串传递给 parse
,它就会正常工作。在这个例子中,我不得不即兴发挥并使用自定义的时间戳解析器来匹配问题的要求;如果时间戳是通用格式,例如ISO 8601,可以只使用 fmt="{asctime:ti} - {module} - {levelname} - {message}"
并从示例代码中丢弃 parse_logging_time
和 custom_parsers
。 parse
支持多种开箱即用的常见时间戳格式;查看boltons
。
parse.Result
是类似 dict 的对象,因此 parsed_record["message"]
返回解析后的消息等。
注意打印的 ParsedException
对象 - 这是从回溯中解析出的异常。