熊猫txt文件到数据框

时间:2020-03-16 13:44:06

标签: python pandas

我有一个txt文件,其中包含以下日志条目:

-------------------> 2020-03-04 14:41:11.578 
Unable to process update. Multiple Entries
<------------------- 2020-03-04 14:41:16.000

我正在尝试为每一行获取一列:

start_time            event_desc                    end_time
2020-03-04 14:41.00    Unable to process update    2020-03-04 14:41:16.000 

我尝试了以下代码:

log_list = []
with open(path_to_file) as file_object:
    for line in file_object:
        log_list.append(line)
df_log = pd.DataFrame(log_list, columns=['log_entries'])
df_log['start_time'] = df_log['log_entries'].str.extract(r'(?<=^\-{19}\>)\s(P<start_time>\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3})')

df_log['event_desc'] = df_log['log_entries'].str.extract(r'(^\w.+)')

df_log['end_datetime'] = df_log['log_entries'].str.extract(r'(?<=^\<\-{19})\s(\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3})')

这有效,但偶数描述与开始和结束时间不一致。我曾考虑删除NA行,但是我认为可能会有一个更优雅的解决方案?

谢谢!

1 个答案:

答案 0 :(得分:2)

我将在解析时拆分文件,而不是使用read_csv,因为文件不是csv格式:

start = re.compile(r'(?<=^\-{19}\>)\s(?P<start_time>\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3})')
end = re.compile('(?<=^\<\-{19})\s(\d{4}\-\d{2}\-\d{2}\s\d{2}\:\d{2}\:\d{2}\.\d{3})')
word = re.compile('(^\w.+)')
data = []

for line in io.StringIO(t):
    match = start.search(line)
    if match:
        row = {'start_time': match.group('start_time')}
        data.append(row)
    else:
        match = end.search(line)
        if match:
            row['end_time'] = match.group(1)
        else:
            match = word.search(line)
            if match:
                row['event_desc'] = match.group(1)

df = pd.DataFrame(data, columns=['start_time', 'event_desc', 'end_time'])