我有以下字符串:
string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
现在,我想将其拆分为两个句子。
但是,当我这样做时:
string.split('.')
我得到:
['This is one sentence ${w_{1},',
'',
',w_{i}}$',
' This is another sentence',
' ']
任何人都有一个如何改进它的想法,以免发现“”。在$ $
内?
此外,您将如何处理:
string2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! '
编辑1:
所需的输出将是:
对于字符串1:
['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence']
对于字符串2:
['This is one sentence ${w_{1},..,w_{i}}$','This is another sentence', 'Is this a sentence', 'Maybe ! ']
答案 0 :(得分:3)
对于更一般的情况,您可以像这样使用re.split
:
import re
mystr = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
re.split("[.!?]\s{1,}", mystr)
# ['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', '']
str2 = 'This is one sentence ${w_{1},..,w_{i}}$! This is another sentence. Is this a sentence? Maybe ! '
re.split("[.!?]\s{1,}", str2)
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe ', '']
括号中的字符是您选择的标点符号,您在\s{1,}
的末尾添加至少一个空格以忽略其他.
,它们之间没有空格。这也将处理您的感叹号案件
这是一种使标点符号恢复原状的方法(
punct = re.findall("[.!?]\s{1,}", str2)
['! ', '. ', '? ', '! ']
sent = [x+y for x,y in zip(re.split("[.!?]\s{1,}", str2), punct)]
sent
['This is one sentence ${w_{1},..,w_{i}}$! ', 'This is another sentence. ', 'Is this a sentence? ', 'Maybe ! ']
答案 1 :(得分:3)
您可以将re.findall
与交替模式一起使用。为确保句子以非空格开头和结尾,请在开头使用正向超前模式,在结尾使用正向后向模式:
re.findall(r'((?=[^.!?\s])(?:$.*?\$|[^.!?])*(?<=[^.!?\s]))\s*[.!?]', string)
这将返回第一个字符串:
['This is one sentence ${w_{1},..,w_{i}}$', 'This is another sentence']
,然后输入第二个字符串:
['This is one sentence ${w_{1},', ',w_{i}}$', 'This is another sentence', 'Is this a sentence', 'Maybe']
答案 2 :(得分:0)
使用'。 '(在。后面有一个空格),因为它仅在句子结束时才存在,而不在句子中间。
string = 'This is one sentence ${w_{1},..,w_{i}}$. This is another sentence. '
string.split('. ')
这将返回:
['这是一个句子$ {w_ {1},..,w_ {i}} $','这是另一个句子','']