在为特定的极端情况正确识别文本中的句子时,我有些麻烦:
"
。到目前为止,这是我识别文本中句子的方式(来源:Subtitles Reformat to end with complete sentence):
re.findall
部分基本上是查找str
的一部分,该部分以大写字母[A-Z]
开头,然后是除标点符号之外的所有内容,然后以标点符号[\.?!]
结尾
import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question. Next, we also determined the size of the population.
案例1:点,点,点
不保留点,点,点,因为如果连续出现三个点,则没有给出如何处理的说明。如何更改?
text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question. Next, we also determined the size of the population.
案例2:”
"
符号已成功保留在句子中,但是像标点后面的点一样,它会在末尾删除。
text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first "research" question: "What is this? Next, we also determined the size of the population.
案例3:小写句子的开头
如果一个句子意外地以小写开头,则该句子将被忽略。目的是确定先前的句子已结束(或文本刚开始),因此必须开始新的句子。
text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question.
非常感谢您的帮助!
编辑:
我测试过:
import spacy
from spacy.lang.en import English
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
...但是我得到了
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-157-4fd093d3402b> in <module>() 6 nlp = English() 7 doc = nlp(raw_text) ----> 8 sentences = [sent.string.strip() for sent in doc.sents] <ipython-input-157-4fd093d3402b> in <listcomp>(.0) 6 nlp = English() 7 doc = nlp(raw_text) ----> 8 sentences = [sent.string.strip() for sent in doc.sents] doc.pyx in sents() ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:
nlp.add_pipe(nlp.create_pipe('sentencizer'))或者,添加 依赖分析器,或通过设置来设置句子边界 doc [i] .is_sent_start。
答案 0 :(得分:2)
您可以为此使用一些工业包装。例如,spacy具有非常好的句子标记器。
from __future__ import unicode_literals, print_function
from spacy.en import English
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
您的方案:
案例结果-> ['We were able to respond to the first research question...', 'Next, we also determined the size of the population.']
案例结果-> ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
案例结果-> ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']
答案 1 :(得分:2)
您可以修改正则表达式以匹配您的特殊情况。
首先,您不需要在.
内转义[]
对于第一个极端情况,您可以用[.!?]*
贪婪地匹配end-ancetance-token。
第二次,您可以匹配"
之后的[.!?]
对于最后一个,您可以从上或下开始:
import re
regex = r'([A-z][^.!?]*[.!?]*"?)'
text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
print(sentence)
print()
text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
print(sentence)
print()
text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(regex, text):
print(sentence)
[A-z]
,每次比赛都应以大写或小写字母开头。[^.?!]*
,它贪婪地匹配不是.
,?
或!
(结束情感字符)的任何字符[.?!]*
,它会贪婪地匹配结尾字符,因此...??!!???
将作为情感部分进行匹配"?
,它最终与句子结尾处的报价匹配情况一:
我们能够回答第一个研究问题... 接下来,我们还确定了人口规模。
情况2:
我们能够回答第一个“研究”问题:“这是什么?” 接下来,我们还确定了人口规模。
情况3:
我们能够回答第一个研究问题。 接下来,我们还确定了人口规模。
答案 2 :(得分:1)
您可以使用nltk sent_tokenize。这样可以避免很多麻烦。
from nltk import sent_tokenize
# Corner Case 1: Dot, Dot, Dot
text_dot_dot_dot = "We were able to respond to the first research question... Next, we also determined the size of the population."
print("Corner Case 1: ", sent_tokenize(text_dot_dot_dot))
# Corner Case 1: "
text_ = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
print("Corner Case 2: ", sent_tokenize(text_))
# Corner Case 1: lower case
text_lower = "We were able to respond to the first research question. next, we also determined the size of the population."
print("Corner Case 2: ", sent_tokenize(text_lower))
结果:
Corner Case 1: ['We were able to respond to the first research question... Next, we also determined the size of the population.']
Corner Case 2: ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
Corner Case 2: ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']
答案 3 :(得分:0)
尝试以下正则表达式: ([A-Z] [^。!?] * [。!?] + [“]?)
'+'表示一个或多个
'?'表示零或更多
这应该通过您上面提到的所有3个极端情况