Question

我正在阅读“使用Python进行自然语言处理”一书中的第6章（http://www.nltk.org/book/ch06.html）

我试图用cess_esp语料库复制句子分割的实验。我逐行遵循代码，它似乎有效，直到我尝试使用它来分割我自己的文本。

>>> import nltk
>>> from nltk.corpus import cess_esp
>>> sentences = cess_esp.sents()
>>> tokens = []
>>> boundaries = set()
>>> offset = 0
>>> for sent in sentences:
        tokens.extend(sent)
        offset += len(sent)
        boundaries.add(offset-1)


>>> def punct_features(tokens,i):
        return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

>>> featureset = [(punct_features(tokens, i), (i in boundaries))
              for i in range(1, len(tokens)-1)
              if tokens[i] in '.?!']
>>> size = int(len(featureset) * 0.1)
>>> train_set, test_set = featureset[size:], featureset[:size]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.9983388704318937

到目前为止一切顺利。但是当我尝试使用该功能来分割我的文本时，我收到了一个错误。

def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents

new_text = ['En'，'un'，'lugar'，'de'，'la'，'Mancha'，'，'，'de'， 'cuyo'，'nombre'，'no'，'quiero'，'acordarme'，'，'，'no'，'ha'， 'mucho'，'tiempo'，'que'，'vivía'，'un'，'hidalgo'，'de'，'los'，'de'， 'lanza'，'en'，'astillero'，'，'，'adarga'，'antigua'，'，'，'rocín'， 'flaco'，'y'，'galgo'，'corredor'，'。'，'Una'，'olla'，'de'，'algo'， 'más'，'vaca'，'que'，'carnero'，'，'，'salpicón'，'las'，'más'， 'noches'，'，'，'duelos'，'y'，'quebrantos'，'los'，'sábados'，'，'， 'lantejas'，'los'，'viernes'，'，'，'algún'，'palomino'，'de'， 'añadidura'，'los'，'domingos'，'，'，'consumían'，'las'，'tres'， 'partes'，'de'，'su'，'hacienda'，'。'，'El'，'resto'，'della'， 'concluían'，'sayo'，'de'，'velarte'，'，'，'calzas'，'de'，'velludo'， 'para'，'las'，'fiestas'，'，'，'con'，'sus'，'pantuflos'，'de'，'lo'， 'mesmo'，'，'，'y'，'los'，'días'，'de'，'entresemana'，'se'， 'honraba'，'con'，'su'，'vellorí'，'de'，'lo'，'más'，'fino'，'。']

segment_sentences(new_text)
Traceback (most recent call last):
  File "<pyshell#31>", line 1, in <module>
    segment_sentences(texto)
  File "<pyshell#26>", line 5, in segment_sentences
    if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
  File "<pyshell#16>", line 2, in punct_features
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
IndexError: list index out of range

我一直在调整一些数字以确定我是否可以修复索引超出范围错误，但它不起作用。

感谢任何帮助

Answer 1

看起来您需要循环enumerate(words[:-1])而不是enumerate(words)。

正如您所写，您在列表中的最后一个单词上调用punct_features(words, i)。当列表中的最后一个单词的索引（i）传递给punct_features()时，您会尝试访问words[i+1]（tokens[i+1]。因为只有{{1} i中的项目，您获得words。

NLTK句子边界错误

1 个答案: