Question

输入是一个两句话的字符串：

s = 'Sentence 1 here.  This sentence contains 1 fl. oz. but is one sentence.'

我希望.split s根据以下逻辑加入句子：

句子以一个或多个句号，感叹号，问号或句号+引号结尾
然后是1个空格字符和大写字母字符。

期望的结果：

['Sentence 1 here.', 'This sentence contains 1 fl. oz. but is one sentence.']

也没关系：

['Sentence 1 here', 'This sentence contains 1 fl. oz. but is one sentence.']

但我现在砍掉了每个句子的第0个元素，因为捕获了大写字符：

import re
END_SENT = re.compile(r'[.!?(.")]+[ ]+[A-Z]')
print(END_SENT.split(s))
['Sentence 1 here', 'his sentence contains 1 fl. oz. but is one sentence.']

注意缺少的 T 。如何告诉.split忽略已编译模式的某些元素？

Answer 1

描述句子比尝试识别分隔符更容易。因此，不要re.split尝试使用re.findall：

re.findall(r'([^.?!\s].*?[.?!]*)\s*(?![^A-Z])', s)

要保留下一个大写字母，该模式使用的前瞻只是一个测试，不会消耗字符。

细节：

(     # capture group: re.findall return only the capture group content if any
    [^.?!\s]   # the first character isn't a space or a punctuation character
    .*?        # a non-greedy quantifier
    [.?!]*     # eventual punctuation characters
)
\s*            # zero or more white-spaces
(?![^A-Z])     # not followed by a character that isn't a uppercase letter
               # (this includes an uppercase letter and the end of the string)

显然，对于包含缩写，名称等的更复杂的情况，您必须使用nltk等工具或使用字典训练的任何其他nlp工具。

Answer 2

((?<=[.!?])|(?<=\.\")) +(?=[A-Z])

试试here。

虽然我建议以下内容允许引号后跟任何一个。！？是一个分裂的条件

((?<=[.!?])|(?<=[.!?]\")) +(?=[A-Z])

试试here。

<强>解释

+(?=[A-Z])

中的常见内容

' +'    #One or more spaces(The actual splitting chars used.)
(?=     #START positive look ahead check if it followed by this, but do not consume
[A-Z]   #Any capitalized alphabet
)       #END positive look ahead

空间之前的条件
适用于Solution1

(     #GROUP START
(?<=  #START Positive look behind, Make sure this comes before but do not consume
[.!?] #any one of these chars should come before the splitting space
)     #END positive look behind
|     #OR condition this is also the reason we had to put all this in GROUP
(?<=  #START Positive look behind,
\.\"  #splitting space could precede by .", covering a condition that is not by the previous set of . or ! or ?
)     #END positive look behind
)     #END GROUP

对于Solution2

(             #GROUP START
(?<=[.!?])    #Same as the previous look behind
|             #OR condition
(?<=[.!?]\")  #Only difference here is that we are allowing quote after any of . or ! or ? 
)             #GROUP END

在`re.split`

2 个答案: