
时间:2019-05-17 07:52:32

标签: python regex python-3.x string


  1. 如果涉及点,点,点,则不会保留。
  2. 如果涉及"
  3. 如果句子不小心以小写开头。

到目前为止,这是我识别文本中句子的方式(来源:Subtitles Reformat to end with complete sentence):


import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
    for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
        print(sentence + "\n")
We were able to respond to the first research question.

Next, we also determined the size of the population.



text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
We were able to respond to the first research question.

Next, we also determined the size of the population.



text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
We were able to respond to the first "research" question: "What is this?

Next, we also determined the size of the population.



text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first research question.




import spacy
from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]


ValueError                                Traceback (most recent call last)
<ipython-input-157-4fd093d3402b> in <module>()
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

<ipython-input-157-4fd093d3402b> in <listcomp>(.0)
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

doc.pyx in sents()

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:

nlp.add_pipe(nlp.create_pipe('sentencizer'))或者,添加   依赖分析器,或通过设置来设置句子边界   doc [i] .is_sent_start。

4 个答案:

答案 0 :(得分:2)


from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]


  1. 案例结果-> ['We were able to respond to the first research question...', 'Next, we also determined the size of the population.']

  2. 案例结果-> ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']

  3. 案例结果-> ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']

答案 1 :(得分:2)






import re

regex = r'([A-z][^.!?]*[.!?]*"?)'

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(regex, text):

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(regex, text):

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(regex, text):


  • [A-z],每次比赛都应以大写或小写字母开头。
  • [^.?!]*,它贪婪地匹配不是.?!(结束情感字符)的任何字符
  • [.?!]*,它会贪婪地匹配结尾字符,因此...??!!???将作为情感部分进行匹配
  • "?,它最终与句子结尾处的报价匹配



我们能够回答第一个研究问题...   接下来,我们还确定了人口规模。



我们能够回答第一个“研究”问题:“这是什么?”   接下来,我们还确定了人口规模。



我们能够回答第一个研究问题。   接下来,我们还确定了人口规模。

答案 2 :(得分:1)

您可以使用nltk sent_tokenize。这样可以避免很多麻烦。

from nltk import sent_tokenize
# Corner Case 1: Dot, Dot, Dot
text_dot_dot_dot = "We were able to respond to the first research question... Next, we also determined the size of the population."
print("Corner Case 1: ", sent_tokenize(text_dot_dot_dot))
# Corner Case 1: "
text_ = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
print("Corner Case 2: ", sent_tokenize(text_))
# Corner Case 1: lower case
text_lower = "We were able to respond to the first research question. next, we also determined the size of the population."
print("Corner Case 2: ", sent_tokenize(text_lower))


Corner Case 1:  ['We were able to respond to the first research question... Next, we also determined the size of the population.']
Corner Case 2:  ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
Corner Case 2:  ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']

答案 3 :(得分:0)

尝试以下正则表达式: ([A-Z] [^。!?] * [。!?] + [“]?)


