我正在尝试使用Pandas数据框架解析此文本文件。 文本文件采用以下特定格式:
Name: Tom
Gender: Male
Books:
The problem of Pain
The reason for God: belief in an age of skepticism
到目前为止,我导入数据的代码是:
import pandas as pd
df = pd.read_table(filename, sep=":|\n", engine='python', index_col=0)
print df
我得到的输出是:
Name Tom
Gender Male
Books NaN
The problem of Pain NaN
The reason for God belief in an age of skepticism
我应该如何更改代码,使我得到的输出为:(编辑后的输出)
Name Gender Books
Tom Male The problem of Pain, The reason for God: belief in an age of skepticism
感谢您的帮助!
答案 0 :(得分:1)
您可以做两件事。您可以使用enumerate()
,并使用if
语句:在下面的代码中,我使用了一个名为test.txt
的文本文件。
import pandas as pd
d = {}
value_list = []
for index, text in enumerate(open('test.txt', "r")):
if index < 2:
d[text.split(':')[0]] = text.split(':')[1].rstrip('\n')
elif index ==2:
value = text.split(':')[0]
else:
value_list.append(text.rstrip('\n'))
d[value] = [value_list]
df = pd.DataFrame.(d)
相反,您可以使用readlines()
,然后对每一行进行切片以获取并填充dictionary
,然后创建一个数据框。
import pandas as pd:
text_file = open('test.txt', "r")
lines = text_file.readlines()
d = {}
d[lines[0:1][0].split(':')[0]] = lines[0:1][0].split(':')[1].rstrip('\n')
d[lines[1:2][0].split(':')[0]] = lines[1:2][0].split(':')[1].rstrip('\n')
d[lines[2:3][0].split(':')[0]] = [lines[3:]]
df = pd.DataFrame(d)
答案 1 :(得分:0)
我使用的方法很简单:regex
。
import os, re
import pandas as pd
# List out the all files in dir that ends with .txt
files = [file for file in os.listdir(PROFILES) if file.endswith(".txt")]
HEADERS = ['Name', 'Gender', 'Books']
DATA = [] # create the empty list to store profiles
for file in files: # iterate over each file
filename = PROFILES + file # full path name of the data files
text_file = open(filename, "r") # open the file
lines = text_file.read() # read the file in memory
text_file.close() # close the file
###############################################################
# Regex to filter out all the column header and row data. ####
# Odd Number == Header, Even Number == Data ##################
###############################################################
books = re."(Name):(.*)\n+(Gender):(.*)\n+(Books):((?<=Books:)\D+)",lines)
# append data into DATA list
DATA.append([books.group(i).strip() for i in range(len(books.groups()) + 1) if not i % 2 and i != 0])
profilesDF = pd.DataFrame(DATA, columns=HEADERS) # create the dataframe