Question

我正在尝试使用Pandas数据框架解析此文本文件。文本文件采用以下特定格式：

Name: Tom 
Gender: Male

Books:
The problem of Pain 
The reason for God: belief in an age of skepticism

到目前为止，我导入数据的代码是：

import pandas as pd

df = pd.read_table(filename, sep=":|\n", engine='python', index_col=0)
print df

我得到的输出是：

Name                     Tom   
Gender                   Male
Books                    NaN
The problem of Pain      NaN
The reason for God       belief in an age of skepticism

我应该如何更改代码，使我得到的输出为：（编辑后的输出）

Name     Gender    Books
Tom      Male      The problem of Pain, The reason for God: belief in an age of skepticism

感谢您的帮助！

Answer 1

您可以做两件事。您可以使用enumerate()，并使用if语句：在下面的代码中，我使用了一个名为test.txt的文本文件。

import pandas as pd
d = {}
value_list = []
for index, text in enumerate(open('test.txt', "r")):
    if index < 2:
        d[text.split(':')[0]] = text.split(':')[1].rstrip('\n')
    elif index ==2:
        value = text.split(':')[0]
    else:
        value_list.append(text.rstrip('\n'))
d[value] = [value_list]
df = pd.DataFrame.(d)

相反，您可以使用readlines()，然后对每一行进行切片以获取并填充dictionary，然后创建一个数据框。

import pandas as pd:
text_file = open('test.txt', "r")
lines = text_file.readlines()
d = {}
d[lines[0:1][0].split(':')[0]] = lines[0:1][0].split(':')[1].rstrip('\n')
d[lines[1:2][0].split(':')[0]] = lines[1:2][0].split(':')[1].rstrip('\n')
d[lines[2:3][0].split(':')[0]] = [lines[3:]]
df = pd.DataFrame(d)

Answer 2

我使用的方法很简单：regex。

import os, re
import pandas as pd


# List out the all files in dir that ends with .txt
files = [file for file in os.listdir(PROFILES) if file.endswith(".txt")]

HEADERS = ['Name', 'Gender', 'Books']
DATA = []  # create the empty list to store profiles

for file in files:  # iterate over each file
    filename = PROFILES + file  # full path name of the data files

    text_file = open(filename, "r")  # open the file
    lines = text_file.read()  # read the file in memory
    text_file.close()  # close the file

    ###############################################################
    # Regex to filter out all the column header and row data. ####
    # Odd Number == Header, Even Number == Data ##################
    ###############################################################

    books = re."(Name):(.*)\n+(Gender):(.*)\n+(Books):((?<=Books:)\D+)",lines)

    # append data into DATA list
    DATA.append([books.group(i).strip() for i in range(len(books.groups()) + 1) if not i % 2 and i != 0])

profilesDF = pd.DataFrame(DATA, columns=HEADERS) # create the dataframe

使用熊猫从具有多种条件的文本文件中导入数据

2 个答案: