我有一个文本文件,如下所示:
Hypothesis:
drink
Reference:
Drake
WER:
100.0
Time:
2.416645050048828
"---------------------------"
Hypothesis:
Ed Sheeran
Reference:
Ed Sheeran
WER:
0.0
Time:
2.854194164276123
当我尝试以["Hypothesis", "Reference","WER","Time"]
作为列将其读取为panda.DataFrame时,它将返回错误。
我尝试过:
txt= pd.read_csv("/home/kolagaza/Desktop/IAIS_en.txt", sep="---------------------------", header = None, engine='python')
data.columns = ["Hypothesis", "Reference","WER","Time"]
答案 0 :(得分:0)
我认为您无需先进行一些预处理就可以将文本文件直接读取到熊猫DataFrame
中。一种方法是将输入内容转换为熊猫records
格式,即像这样的字典列表:
[{'Hypothesis': 'drink', 'Reference': 'Drake', 'WER': '100.0', 'Time': '2.416645050048828'},
{'Hypothesis': 'Ed Sheeran','Reference': 'Ed Sheeran', 'WER': '0.0', 'Time': '2.854194164276123'}]
我尝试了以下代码,并且对我有用(我复制了示例文本文件):
import pandas as pd
records = []
with open ("/home/kolagaza/Desktop/IAIS_en.txt", "r") as fh:
# remove blank lines and whitespaces
lines = [line.strip() for line in fh.readlines() if line != "\n"]
# this next line creates a list where each element will represent one line in the final dataframe
lines = ",".join(lines).replace(':,', ':').split('"---------------------------"')
# now convert each line into a record
for line in lines:
record = {}
for keyval in line.split(','):
if len(keyval) > 0:
key, val = keyval.split(':')
record[key] = val
records.append(record)
df = pd.DataFrame(records)