给定具有以下格式的文件:
really:1 christensen:1 scariest:1 many_of:1 label:positive
varied_experiences:1 experiences_from:1 island_resident:1 many_and:1 label:positive
scariest:1 many_of:1 label:negative
最后一列是极性标签,值为负或正。之前的其他列是相应段落的词袋表示。如何将文件读入具有两列的数据框中,第一列是字袋字符串,第二列是标签?提前谢谢!
答案 0 :(得分:1)
您只需要read_csv
:
import pandas as pd
import io
temp=u"""really:1 christensen:1 scariest:1 many_of:1 label:positive
varied_experiences:1 experiences_from:1 island_resident:1 many_and:1 label:positive
scariest:1 many_of:1 label:negative"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
sep=r" label:",
header=None,
names=['bag','label'],
engine='python')
print (df)
bag label
0 really:1 christensen:1 scariest:1 many_of:1 positive
1 varied_experiences:1 experiences_from:1 island... positive
2 scariest:1 many_of:1 negative
更一般的解决方案,rsplit
最后一个空格:
import pandas as pd
import io
temp=u"""really:1 christensen:1 scariest:1 many_of:1 label:positive
varied_experiences:1 experiences_from:1 island_resident:1 many_and:1 label:positive
scariest:1 many_of:1 label:negative"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp),
sep=";", #some string which is NOT in all text
header=None,
names=['text'])
print (df)
text
0 really:1 christensen:1 scariest:1 many_of:1 la...
1 varied_experiences:1 experiences_from:1 island...
2 scariest:1 many_of:1 label:negative
df[['bag','label']] = df.text.str.rsplit(expand=True, n=1)
df = df.drop('text', axis=1)
print (df)
bag label
0 really:1 christensen:1 scariest:1 many_of:1 label:positive
1 varied_experiences:1 experiences_from:1 island... label:positive
2 scariest:1 many_of:1 label:negative