使用python将字符串拆分为句子

时间:2019-04-11 20:21:56

标签: python string

我有以下字符串:

string = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

现在,我想将其拆分为两个句子。

但是,当我这样做时:

string.split('.')

我得到:

['This is one sentence  ${w_{1},',
 '',
 ',w_{i}}$',
 ' This is another sentence',
 ' ']

任何人都有一个如何改进它的想法,以免发现“”。在$ $内?

此外,您将如何处理:

string2 = 'This is one sentence  ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe !  '

编辑1:

所需的输出将是:

对于字符串1:

['This is one sentence  ${w_{1},..,w_{i}}$','This is another sentence']

对于字符串2:

['This is one sentence  ${w_{1},..,w_{i}}$','This is another sentence', 'Is this a sentence', 'Maybe !  ']

3 个答案:

答案 0 :(得分:3)

对于更一般的情况,您可以像这样使用re.split

import re

mystr = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

re.split("[.!?]\s{1,}", mystr)
# ['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence', '']

str2 = 'This is one sentence  ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe !  '

re.split("[.!?]\s{1,}", str2)
['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe ', '']

括号中的字符是您选择的标点符号,您在\s{1,}的末尾添加至少一个空格以忽略其他.,它们之间没有空格。这也将处理您的感叹号案件

这是一种使标点符号恢复原状的方法(

punct = re.findall("[.!?]\s{1,}", str2)
['! ', '. ', '? ', '!  ']

sent = [x+y for x,y in zip(re.split("[.!?]\s{1,}", str2), punct)]
sent
['This is one sentence  ${w_{1},..,w_{i}}$! ', 'This is another sentence. ', 'Is this a sentence? ', 'Maybe !  ']

答案 1 :(得分:3)

您可以将re.findall与交替模式一起使用。为确保句子以非空格开头和结尾,请在开头使用正向超前模式,在结尾使用正向后向模式:

re.findall(r'((?=[^.!?\s])(?:$.*?\$|[^.!?])*(?<=[^.!?\s]))\s*[.!?]', string)

这将返回第一个字符串:

['This is one sentence  ${w_{1},..,w_{i}}$', 'This is another sentence']

,然后输入第二个字符串:

['This is one sentence  ${w_{1},', ',w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe']

答案 2 :(得分:0)

使用'。 '(在。后面有一个空格),因为它仅在句子结束时才存在,而不在句子中间。

string = 'This is one sentence  ${w_{1},..,w_{i}}$. This is another sentence. '

string.split('. ')

这将返回:

['这是一个句子$ {w_ {1},..,w_ {i}} $','这是另一个句子','']