读取带有不正常分隔符的文本文件到熊猫数据框

时间:2020-02-12 13:42:44

标签: python pandas text

我有一个文本文件,如下所示:

Hypothesis:

drink

Reference:

Drake
WER:

100.0

Time:

2.416645050048828

"---------------------------"

Hypothesis:

Ed Sheeran

Reference:

Ed Sheeran

WER:

0.0

Time:

2.854194164276123

当我尝试以["Hypothesis", "Reference","WER","Time"]作为列将其读取为panda.DataFrame时,它将返回错误。

我尝试过:

txt= pd.read_csv("/home/kolagaza/Desktop/IAIS_en.txt", sep="---------------------------", header = None, engine='python')

data.columns = ["Hypothesis", "Reference","WER","Time"]

1 个答案:

答案 0 :(得分:0)

我认为您无需先进行一些预处理就可以将文本文件直接读取到熊猫DataFrame中。一种方法是将输入内容转换为熊猫records格式,即像这样的字典列表:

[{'Hypothesis': 'drink', 'Reference': 'Drake', 'WER': '100.0', 'Time': '2.416645050048828'},
 {'Hypothesis': 'Ed Sheeran','Reference': 'Ed Sheeran', 'WER': '0.0', 'Time': '2.854194164276123'}]

我尝试了以下代码,并且对我有用(我复制了示例文本文件):

import pandas as pd

records = []
with open ("/home/kolagaza/Desktop/IAIS_en.txt", "r") as fh:
    # remove blank lines and whitespaces
    lines = [line.strip() for line in fh.readlines() if line != "\n"]
    # this next line creates a list where each element will represent one line in the final dataframe
    lines = ",".join(lines).replace(':,', ':').split('"---------------------------"')
    # now convert each line into a record
    for line in lines:
        record = {}
        for keyval in line.split(','):
            if len(keyval) > 0:
                key, val = keyval.split(':')
                record[key] = val
        records.append(record)

df = pd.DataFrame(records)