我有一个字符串来自一篇有几百个句子的文章。我想将字符串转换为数据帧,每个句子作为一行。例如,
n = 0
p = 0
z = 0
for i in range(10):
i = input('Enter Next Number:')
if (i > 0):
p = p+1
elif (i < 0):
n = n+1
else:
z = z+1
print "The number of negative numbers is",n
print "The number of positive numbers is",p
print "The number of zeros is",z
我希望它变成:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
作为一名蟒蛇新手,这就是我的尝试:
This is a book, to which I found exciting.
I bought it for my cousin.
He likes it.
使用上面的代码,所有句子都成为列名。我实际上想要它们在一列的行中。
答案 0 :(得分:5)
请勿使用read_csv
。只需按'.'
拆分并使用标准pd.DataFrame
:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
data_df = pd.DataFrame([sentence for sentence in data.split('.') if sentence],
columns=['sentences'])
print(data_df)
# sentences
# 0 This is a book, to which I found exciting
# 1 I bought it for my cousin
# 2 He likes it
请记住,如果存在,这将会中断
某些句子中的浮点数。在这种情况下,您需要更改字符串的格式(例如,使用'\n'
代替'.'
来分隔句子。)
答案 1 :(得分:1)
这是一个快速解决方案,但它解决了您的问题:
data_df = pd.read_csv(data, sep=".", header=None).T
答案 2 :(得分:1)
您可以通过列表理解来实现这一目标:
data = 'This is a book, to which I found exciting. I bought it for my cousin. He likes it.'
df = pd.DataFrame({'sentence': [i+'.' for i in data.split('. ')]})
print(df)
# sentence
# 0 This is a book, to which I found exciting.
# 1 I bought it for my cousin.
# 2 He likes it.
答案 3 :(得分:0)
您要做的是称为标记化句子。最简单的方法是使用文本挖掘库,例如NLTK:
from nltk.tokenize import sent_tokenize
pd.DataFrame(sent_tokenize(data))
否则你可以尝试类似的东西:
pd.DataFrame(data.split('. '))
但是,如果你遇到这样的句子,这将失败:
problem = 'Tim likes to jump... but not always!'