我有一个txt文件,其中包含以下日志条目:
-------------------> 2020-03-04 14:41:11.578
Unable to process update. Multiple Entries
<------------------- 2020-03-04 14:41:16.000
我正在尝试为每一行获取一列:
start_time event_desc end_time
2020-03-04 14:41.00 Unable to process update 2020-03-04 14:41:16.000
我尝试了以下代码:
log_list = []
with open(path_to_file) as file_object:
for line in file_object:
log_list.append(line)
df_log = pd.DataFrame(log_list, columns=['log_entries'])
df_log['start_time'] = df_log['log_entries'].str.extract(r'(?<=^\-{19}\>)\s(P<start_time>\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3})')
df_log['event_desc'] = df_log['log_entries'].str.extract(r'(^\w.+)')
df_log['end_datetime'] = df_log['log_entries'].str.extract(r'(?<=^\<\-{19})\s(\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3})')
这有效,但偶数描述与开始和结束时间不一致。我曾考虑删除NA行,但是我认为可能会有一个更优雅的解决方案?
谢谢!
答案 0 :(得分:2)
我将在解析时拆分文件,而不是使用read_csv
,因为文件不是csv格式:
start = re.compile(r'(?<=^\-{19}\>)\s(?P<start_time>\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3})')
end = re.compile('(?<=^\<\-{19})\s(\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3})')
word = re.compile('(^\w.+)')
data = []
for line in io.StringIO(t):
match = start.search(line)
if match:
row = {'start_time': match.group('start_time')}
data.append(row)
else:
match = end.search(line)
if match:
row['end_time'] = match.group(1)
else:
match = word.search(line)
if match:
row['event_desc'] = match.group(1)
df = pd.DataFrame(data, columns=['start_time', 'event_desc', 'end_time'])