如何使用python-pandas读取包含不均匀列数的文本文件?

时间:2016-06-06 05:37:27

标签: python pandas dataframe

给定具有以下格式的文件:

really:1 christensen:1 scariest:1 many_of:1 label:positive
varied_experiences:1 experiences_from:1 island_resident:1 many_and:1 label:positive
scariest:1 many_of:1 label:negative

最后一列是极性标签,值为负或正。之前的其他列是相应段落的词袋表示。如何将文件读入具有两列的数据框中,第一列是字袋字符串,第二列是标签?提前谢谢!

1 个答案:

答案 0 :(得分:1)

您只需要read_csv

import pandas as pd
import io

temp=u"""really:1 christensen:1 scariest:1 many_of:1 label:positive
varied_experiences:1 experiences_from:1 island_resident:1 many_and:1 label:positive
scariest:1 many_of:1 label:negative"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), 
                 sep=r" label:",
                 header=None, 
                 names=['bag','label'], 
                 engine='python')
print (df)
                                                 bag     label
0       really:1 christensen:1 scariest:1 many_of:1   positive
1  varied_experiences:1 experiences_from:1 island...  positive
2                              scariest:1 many_of:1   negative

更一般的解决方案,rsplit最后一个空格:

import pandas as pd
import io

temp=u"""really:1 christensen:1 scariest:1 many_of:1 label:positive
varied_experiences:1 experiences_from:1 island_resident:1 many_and:1 label:positive
scariest:1 many_of:1 label:negative"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), 
                 sep=";", #some string which is NOT in all text
                 header=None, 
                 names=['text'])
print (df)
                                                text
0  really:1 christensen:1 scariest:1 many_of:1 la...
1  varied_experiences:1 experiences_from:1 island...
2                scariest:1 many_of:1 label:negative

df[['bag','label']] = df.text.str.rsplit(expand=True, n=1)
df = df.drop('text', axis=1)
print (df)
                                                 bag           label
0        really:1 christensen:1 scariest:1 many_of:1  label:positive
1  varied_experiences:1 experiences_from:1 island...  label:positive
2                               scariest:1 many_of:1  label:negative