让我们假设我有以下段落:
"This is the first sentence. This is the second sentence? This is the third
sentence!"
我需要创建一个只返回给定字符数下的句子数的函数。如果它小于一个句子,它将返回第一个句子的所有字符。
例如:
>>> reduce_paragraph(100)
"This is the first sentence. This is the second sentence? This is the third
sentence!"
>>> reduce_paragraph(80)
"This is the first sentence. This is the second sentence?"
>>> reduce_paragraph(50)
"This is the first sentence."
>>> reduce_paragraph(5)
"This "
我从这样的事情开始,但我似乎无法弄清楚如何完成它:
endsentence = ".?!"
sentences = itertools.groupby(text, lambda x: any(x.endswith(punct) for punct in endsentence))
for number,(truth, sentence) in enumerate(sentences):
if truth:
first_sentence = previous+''.join(sentence).replace('\n',' ')
previous = ''.join(sentence)
答案 0 :(得分:6)
由于英语的句法结构,处理句子非常困难。正如有人已经指出的那样,即使是最好的正则表达式,缩写等问题也会引起无休止的麻烦。
您应该考虑Natural Laungauge Toolkit。特别是punkt模块。它是一个句子标记器,它将为你做繁重的工作。
答案 1 :(得分:2)
以下是如何使用punkt
模块mentioned by @BigHandsome截断段落的方法:
from nltk.tokenize.punkt import PunktSentenceTokenizer
def truncate_paragraph(text, maxnchars,
tokenize=PunktSentenceTokenizer().span_tokenize):
"""Truncate the text to at most maxnchars number of characters.
The result contains only full sentences unless maxnchars is less
than the first sentence length.
"""
sentence_boundaries = tokenize(text)
last = None
for start_unused, end in sentence_boundaries:
if end > maxnchars:
break
last = end
return text[:last] if last is not None else text[:maxnchars]
text = ("This is the first sentence. This is the second sentence? "
"This is the third\n sentence!")
for limit in [100, 80, 50, 5]:
print(truncate_paragraph(text, limit))
This is the first sentence. This is the second sentence? This is the third sentence! This is the first sentence. This is the second sentence? This is the first sentence. This
答案 2 :(得分:0)
如果我们忽略自然语言问题(即返回由“。?!”划分的完整块的算法,其中总和小于k),那么以下基本方法将起作用:
def sentences_upto(paragraph, k):
sentences = []
current_sentence = ""
stop_chars = ".?!"
for i, c in enumerate(paragraph):
current_sentence += c
if(c in stop_chars):
sentences.append(current_sentence)
current_sentence = ""
if(i == k):
break
return sentences
return sentences
您的itertools解决方案可以像这样完成:
def sentences_upto_2(paragraph, size):
stop_chars = ".?!"
sentences = itertools.groupby(paragraph, lambda x: any(x.endswith(punct) for punct in stop_chars))
for k, s in sentences:
ss = "".join(s)
size -= len(ss)
if not k:
if size < 0:
return
yield ss
答案 3 :(得分:0)
您可以将此问题分解为更简单的步骤:
示例代码(未经测试):
def reduce_paragraph(para, max_len):
# Split into list of sentences
# A sentence is a sequence of characters ending with ".", "?", or "!".
sentences = re.split(r"(?<=[\.?!])", para)
# Figure out how many sentences we can have and stay under max_len
num_sentences = 0
total_len = 0
for s in sentences:
total_len += len(s)
if total_len > max_len:
break
num_sentences += 1
if num_sentences > 0:
# We can fit at least one sentence, so return whole sentences
return ''.join(sentences[:num_sentences])
else:
# Return a truncated first sentence
return sentences[0][:max_len]