输入是一个两句话的字符串:
s = 'Sentence 1 here. This sentence contains 1 fl. oz. but is one sentence.'
我希望.split
s
根据以下逻辑加入句子:
期望的结果:
['Sentence 1 here.', 'This sentence contains 1 fl. oz. but is one sentence.']
也没关系:
['Sentence 1 here', 'This sentence contains 1 fl. oz. but is one sentence.']
但我现在砍掉了每个句子的第0个元素,因为捕获了大写字符:
import re
END_SENT = re.compile(r'[.!?(.")]+[ ]+[A-Z]')
print(END_SENT.split(s))
['Sentence 1 here', 'his sentence contains 1 fl. oz. but is one sentence.']
注意缺少的 T 。如何告诉.split
忽略已编译模式的某些元素?
答案 0 :(得分:2)
描述句子比尝试识别分隔符更容易。因此,不要re.split
尝试使用re.findall
:
re.findall(r'([^.?!\s].*?[.?!]*)\s*(?![^A-Z])', s)
要保留下一个大写字母,该模式使用的前瞻只是一个测试,不会消耗字符。
细节:
( # capture group: re.findall return only the capture group content if any
[^.?!\s] # the first character isn't a space or a punctuation character
.*? # a non-greedy quantifier
[.?!]* # eventual punctuation characters
)
\s* # zero or more white-spaces
(?![^A-Z]) # not followed by a character that isn't a uppercase letter
# (this includes an uppercase letter and the end of the string)
显然,对于包含缩写,名称等的更复杂的情况,您必须使用nltk等工具或使用字典训练的任何其他nlp工具。
答案 1 :(得分:2)
((?<=[.!?])|(?<=\.\")) +(?=[A-Z])
试试here。
虽然我建议以下内容允许引号后跟任何一个。!?是一个分裂的条件
((?<=[.!?])|(?<=[.!?]\")) +(?=[A-Z])
试试here。
<强>解释强>
+(?=[A-Z])
' +' #One or more spaces(The actual splitting chars used.)
(?= #START positive look ahead check if it followed by this, but do not consume
[A-Z] #Any capitalized alphabet
) #END positive look ahead
空间之前的条件
适用于Solution1
( #GROUP START
(?<= #START Positive look behind, Make sure this comes before but do not consume
[.!?] #any one of these chars should come before the splitting space
) #END positive look behind
| #OR condition this is also the reason we had to put all this in GROUP
(?<= #START Positive look behind,
\.\" #splitting space could precede by .", covering a condition that is not by the previous set of . or ! or ?
) #END positive look behind
) #END GROUP
对于Solution2
( #GROUP START
(?<=[.!?]) #Same as the previous look behind
| #OR condition
(?<=[.!?]\") #Only difference here is that we are allowing quote after any of . or ! or ?
) #GROUP END