使用熊猫从具有多种条件的文本文件中导入数据

时间:2019-01-22 03:52:04

标签: python pandas

我正在尝试使用Pandas数据框架解析此文本文件。 文本文件采用以下特定格式:

Name: Tom 
Gender: Male

Books:
The problem of Pain 
The reason for God: belief in an age of skepticism

到目前为止,我导入数据的代码是:

import pandas as pd

df = pd.read_table(filename, sep=":|\n", engine='python', index_col=0)
print df

我得到的输出是:

Name                     Tom   
Gender                   Male
Books                    NaN
The problem of Pain      NaN
The reason for God       belief in an age of skepticism

我应该如何更改代码,使我得到的输出为:(编辑后的输出)

Name     Gender    Books
Tom      Male      The problem of Pain, The reason for God: belief in an age of skepticism

感谢您的帮助!

2 个答案:

答案 0 :(得分:1)

您可以做两件事。您可以使用enumerate(),并使用if语句:在下面的代码中,我使用了一个名为test.txt的文本文件。

import pandas as pd
d = {}
value_list = []
for index, text in enumerate(open('test.txt', "r")):
    if index < 2:
        d[text.split(':')[0]] = text.split(':')[1].rstrip('\n')
    elif index ==2:
        value = text.split(':')[0]
    else:
        value_list.append(text.rstrip('\n'))
d[value] = [value_list]
df = pd.DataFrame.(d)

相反,您可以使用readlines(),然后对每一行进行切片以获取并填充dictionary,然后创建一个数据框。

import pandas as pd:
text_file = open('test.txt', "r")
lines = text_file.readlines()
d = {}
d[lines[0:1][0].split(':')[0]] = lines[0:1][0].split(':')[1].rstrip('\n')
d[lines[1:2][0].split(':')[0]] = lines[1:2][0].split(':')[1].rstrip('\n')
d[lines[2:3][0].split(':')[0]] = [lines[3:]]
df = pd.DataFrame(d)

答案 1 :(得分:0)

我使用的方法很简单:regex

import os, re
import pandas as pd


# List out the all files in dir that ends with .txt
files = [file for file in os.listdir(PROFILES) if file.endswith(".txt")]

HEADERS = ['Name', 'Gender', 'Books']
DATA = []  # create the empty list to store profiles

for file in files:  # iterate over each file
    filename = PROFILES + file  # full path name of the data files

    text_file = open(filename, "r")  # open the file
    lines = text_file.read()  # read the file in memory
    text_file.close()  # close the file

    ###############################################################
    # Regex to filter out all the column header and row data. ####
    # Odd Number == Header, Even Number == Data ##################
    ###############################################################

    books = re."(Name):(.*)\n+(Gender):(.*)\n+(Books):((?<=Books:)\D+)",lines)

    # append data into DATA list
    DATA.append([books.group(i).strip() for i in range(len(books.groups()) + 1) if not i % 2 and i != 0])

profilesDF = pd.DataFrame(DATA, columns=HEADERS) # create the dataframe