Question

在为特定的极端情况正确识别文本中的句子时，我有些麻烦：

如果涉及点，点，点，则不会保留。
如果涉及"。
如果句子不小心以小写开头。

到目前为止，这是我识别文本中句子的方式（来源：Subtitles Reformat to end with complete sentence）：

re.findall部分基本上是查找str的一部分，该部分以大写字母[A-Z]开头，然后是除标点符号之外的所有内容，然后以标点符号[\.?!]结尾

import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
    for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
        print(sentence + "\n")

We were able to respond to the first research question.

Next, we also determined the size of the population.

案例1：点，点，点

不保留点，点，点，因为如果连续出现三个点，则没有给出如何处理的说明。如何更改？

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first research question.

Next, we also determined the size of the population.

案例2：”

"符号已成功保留在句子中，但是像标点后面的点一样，它会在末尾删除。

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first "research" question: "What is this?

Next, we also determined the size of the population.

案例3：小写句子的开头

如果一个句子意外地以小写开头，则该句子将被忽略。目的是确定先前的句子已结束（或文本刚开始），因此必须开始新的句子。

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first research question.

非常感谢您的帮助！

编辑：

我测试过：

import spacy
from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

...但是我得到了

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-157-4fd093d3402b> in <module>()
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

<ipython-input-157-4fd093d3402b> in <listcomp>(.0)
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

doc.pyx in sents()

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:

nlp.add_pipe（nlp.create_pipe（'sentencizer'））或者，添加依赖分析器，或通过设置来设置句子边界 doc [i] .is_sent_start。

Answer 1

您可以为此使用一些工业包装。例如，spacy具有非常好的句子标记器。

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

您的方案：

案例结果-> ['We were able to respond to the first research question...', 'Next, we also determined the size of the population.']
案例结果-> ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
案例结果-> ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']

Answer 2

您可以修改正则表达式以匹配您的特殊情况。

首先，您不需要在.内转义[]

对于第一个极端情况，您可以用[.!?]*贪婪地匹配end-ancetance-token。

第二次，您可以匹配"之后的[.!?]

对于最后一个，您可以从上或下开始：

import re

regex = r'([A-z][^.!?]*[.!?]*"?)'

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)

说明

[A-z]，每次比赛都应以大写或小写字母开头。
[^.?!]*，它贪婪地匹配不是.，?或!（结束情感字符）的任何字符
[.?!]*，它会贪婪地匹配结尾字符，因此...??!!???将作为情感部分进行匹配
"?，它最终与句子结尾处的报价匹配

情况一：

我们能够回答第一个研究问题... 接下来，我们还确定了人口规模。

情况2：

我们能够回答第一个“研究”问题：“这是什么？” 接下来，我们还确定了人口规模。

情况3：

我们能够回答第一个研究问题。接下来，我们还确定了人口规模。

Answer 3

您可以使用nltk sent_tokenize。这样可以避免很多麻烦。

from nltk import sent_tokenize
# Corner Case 1: Dot, Dot, Dot
text_dot_dot_dot = "We were able to respond to the first research question... Next, we also determined the size of the population."
print("Corner Case 1: ", sent_tokenize(text_dot_dot_dot))
# Corner Case 1: "
text_ = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
print("Corner Case 2: ", sent_tokenize(text_))
# Corner Case 1: lower case
text_lower = "We were able to respond to the first research question. next, we also determined the size of the population."
print("Corner Case 2: ", sent_tokenize(text_lower))

结果：

Corner Case 1:  ['We were able to respond to the first research question... Next, we also determined the size of the population.']
Corner Case 2:  ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
Corner Case 2:  ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']

Answer 4

尝试以下正则表达式：（[A-Z] [^。！？] * [。！？] + [“]？）

'+'表示一个或多个

'？'表示零或更多

这应该通过您上面提到的所有3个极端情况

识别文本中的句子

4 个答案:

说明