Question

我正在尝试在文本中写下用于提取句子的正则表达式。在我的定义中，句子以大写字母[A-Z]开头，以.|!|?结尾。但是，还不是全部。在句子开头之前，应该有一个点“。”，问号“？”，感叹号“！”，空格或字符串开头。同样，在句子结尾之后，应该有空格（或没有空格），后跟大写字母或字符串结尾。

这些规则是用来排除以下虚假句子

Maria has cat etc. my dog. (one sentence not two!)
https://i-am-cat-and-dog/Explain-what-you-are-doing. (not a sentence)
Cats Dogs Cars (not a sentence)

Answer 1

您在帖子中定义的精确正则表达式就是这个

(?:[.!? ]|^)([A-Z][^.!?\n]*[.!?])(?= |[A-Z]|$)

以下是说明：

(?:[.!? ]|^)-根据您的定义，可以确保句子前应加上.或!或?或空格或行首。
[A-Z][^.!?\n]*[.!?]-这是您句子的定义，以大写字母开头，后跟. ! ?以外的其他文本或换行符，并以以下任一结尾.或!或?或行尾
(?= |[A-Z]|$)-再次按照您的定义进行查找，该句子后面应带有空格或大写字母或行尾。

Live Demo

您的句子在第1组得到验证并被捕获。

这是相同的python代码，

import re

arr = ['Maria has cat etc. my dog.','https://i-am-cat-and-dog/Explain-what-you-are-doing.','Cats Dogs Cars','How are you? i am not fine. Where had you been?']

for s in arr:
 print(re.findall(r'(?:[.!? ]|^)([A-Z][^.!?\n]*[.!?])(?= |[A-Z]|$)', s))

哪些印刷品

['Maria has cat etc.']
[]
[]
['How are you?', 'Where had you been?']

第一行有一个句子，第四行有两个句子。

正则表达式的句子边界

1 个答案: