我在python中编写一个脚本,其中包含以下字符串:
a = "write This is mango. write This is orange."
我想将此字符串分解为句子,然后将每个句子添加为列表项,以便它变为:
list = ['write This is mango.', 'write This is orange.']
我尝试过使用TextBlob但是没有正确读取它。(将整个字符串读作一个句子)。
有一种简单的方法吗?
答案 0 :(得分:1)
一种方法是re.split
正向后视断言:
>>> import re
>>> a = "write This is mango. write This is orange."
>>> re.split(r'(?<=\w\.)\s', a)
['write This is mango.', 'write This is orange.']
如果您想在多个分隔符上拆分,请说出.
和,
,然后在断言中使用字符集:
>>> a = "write This is mango. write This is orange. This is guava, and not pear."
>>> re.split(r'(?<=\w[,\.])\s', a)
['write This is mango.', 'write This is orange.', 'This is guava,', 'and not pear.']
另外,您不应该使用list
作为变量的名称,因为这将 shadow 内置list
。
答案 1 :(得分:1)
你应该查看用于python的NLTK。 以下是NLTK.org的样本
>>> import nltk
>>> sentence = """At eight o'clock on Thursday morning
... Arthur didn't feel very good."""
>>> tokens = nltk.word_tokenize(sentence)
>>> tokens
['At', 'eight', "o'clock", 'on', 'Thursday', 'morning',
'Arthur', 'did', "n't", 'feel', 'very', 'good', '.']
>>> tagged = nltk.pos_tag(tokens)
>>> tagged[0:6]
[('At', 'IN'), ('eight', 'CD'), ("o'clock", 'JJ'), ('on', 'IN'),
('Thursday', 'NNP'), ('morning', 'NN')]
对于你的情况,你可以做
import nltk
a = "write This is mango. write This is orange."
tokens = nltk.word_tokenize(a)
答案 2 :(得分:0)
你知道string.split
吗?它可以采用多字符拆分标准:
>>> "wer. wef. rgo.".split(". ")
['wer', 'wef', 'rgo.']
但它对白色空间的数量不太灵活。如果您无法控制完全停止后有多少空格,我建议使用正则表达式(&#34; import re&#34;)。就此而言,你可以分开&#34;。&#34;并清理每个句子前面的空格和最后一个&#34;之后的空列表。&#34;。
答案 3 :(得分:0)
这应该有效。在这里查看.split()函数:http://www.tutorialspoint.com/python/string_split.htm
a = "write This is mango. write This is orange."
print a.split('.', 1)
答案 4 :(得分:0)
<code>a.split()</code>
a.split()似乎是一种简单的方法,但最终会遇到问题。
例如假设你有
a = 'What is the price of the orange? \
It costs $1.39. \
Thank you! \
See you soon Mr. Meowgi.'
a.split('。')会返回:
a[0] = 'What is the price of the orange? It costs $1'
a[1] = '39'
a[2] = 'Thank you! See you soon Mr'
a[3] = 'Meowgi'
我也没考虑
这最终归结为英语语法。我建议像Mike Tung指出的那样研究nltk模块。