在给定字符数下返回句子的函数

时间:2012-08-19 22:20:48

标签: python

让我们假设我有以下段落:

"This is the first sentence. This is the second sentence? This is the third
 sentence!"

我需要创建一个只返回给定字符数下的句子数的函数。如果它小于一个句子,它将返回第一个句子的所有字符。

例如:

>>> reduce_paragraph(100)
"This is the first sentence. This is the second sentence? This is the third
 sentence!"

>>> reduce_paragraph(80)
"This is the first sentence. This is the second sentence?"

>>> reduce_paragraph(50)
"This is the first sentence."

>>> reduce_paragraph(5)
"This "

我从这样的事情开始,但我似乎无法弄清楚如何完成它:

endsentence = ".?!"
sentences = itertools.groupby(text, lambda x: any(x.endswith(punct) for punct in endsentence))
for number,(truth, sentence) in enumerate(sentences):
    if truth:
        first_sentence = previous+''.join(sentence).replace('\n',' ')
    previous = ''.join(sentence)

4 个答案:

答案 0 :(得分:6)

由于英语的句法结构,处理句子非常困难。正如有人已经指出的那样,即使是最好的正则表达式,缩写等问题也会引起无休止的麻烦。

您应该考虑Natural Laungauge Toolkit。特别是punkt模块。它是一个句子标记器,它将为你做繁重的工作。

答案 1 :(得分:2)

以下是如何使用punkt模块mentioned by @BigHandsome截断段落的方法:

from nltk.tokenize.punkt import PunktSentenceTokenizer

def truncate_paragraph(text, maxnchars,
                       tokenize=PunktSentenceTokenizer().span_tokenize):
    """Truncate the text to at most maxnchars number of characters.

    The result contains only full sentences unless maxnchars is less
    than the first sentence length.
    """
    sentence_boundaries = tokenize(text)
    last = None
    for start_unused, end in sentence_boundaries:
        if end > maxnchars:
            break
        last = end
    return text[:last] if last is not None else text[:maxnchars]

实施例

text = ("This is the first sentence. This is the second sentence? "
        "This is the third\n sentence!")
for limit in [100, 80, 50, 5]:
    print(truncate_paragraph(text, limit))

输出

This is the first sentence. This is the second sentence? This is the third
 sentence!
This is the first sentence. This is the second sentence?
This is the first sentence.
This 

答案 2 :(得分:0)

如果我们忽略自然语言问题(即返回由“。?!”划分的完整块的算法,其中总和小于k),那么以下基本方法将起作用:

def sentences_upto(paragraph, k):
    sentences = []
    current_sentence = ""
    stop_chars = ".?!"
    for i, c in enumerate(paragraph):
        current_sentence += c
        if(c in stop_chars):
            sentences.append(current_sentence)
            current_sentence = ""
        if(i == k):
            break
    return sentences
        return sentences

您的itertools解决方案可以像这样完成:

def sentences_upto_2(paragraph, size):
    stop_chars = ".?!"
    sentences = itertools.groupby(paragraph, lambda x: any(x.endswith(punct) for punct in stop_chars))  
    for k, s in sentences:
        ss = "".join(s)
        size -= len(ss)
        if not k:
            if size < 0:
                return
            yield ss

答案 3 :(得分:0)

您可以将此问题分解为更简单的步骤:

  1. 给出一个段落,split将其改为句子
  2. 弄清楚在字符数限制下我们可以加入多少句子
  3. 如果我们至少可以填写一个句子,那么将这些句子加在一起。
  4. 如果第一句话太长,请取第一句并截断它。
  5. 示例代码(未经测试):

        def reduce_paragraph(para, max_len):
            # Split into list of sentences
            # A sentence is a sequence of characters ending with ".", "?", or "!".
            sentences = re.split(r"(?<=[\.?!])", para)
    
            # Figure out how many sentences we can have and stay under max_len
            num_sentences = 0
            total_len = 0
            for s in sentences:
                total_len += len(s)
                if total_len > max_len:
                    break
                num_sentences += 1
    
            if num_sentences > 0:
                # We can fit at least one sentence, so return whole sentences
                return ''.join(sentences[:num_sentences])
            else:
                # Return a truncated first sentence
                return sentences[0][:max_len]