NLTK句子边界错误

时间:2016-05-18 12:01:52

标签: python python-3.x nlp nltk

我正在阅读“使用Python进行自然语言处理”一书中的第6章(http://www.nltk.org/book/ch06.html

我试图用cess_esp语料库复制句子分割的实验。我逐行遵循代码,它似乎有效,直到我尝试使用它来分割我自己的文本。

>>> import nltk
>>> from nltk.corpus import cess_esp
>>> sentences = cess_esp.sents()
>>> tokens = []
>>> boundaries = set()
>>> offset = 0
>>> for sent in sentences:
        tokens.extend(sent)
        offset += len(sent)
        boundaries.add(offset-1)


>>> def punct_features(tokens,i):
        return {'next-word-capitalized': tokens[i+1][0].isupper(),
            'prev-word': tokens[i-1].lower(),
            'punct': tokens[i],
            'prev-word-is-one-char': len(tokens[i-1]) == 1}

>>> featureset = [(punct_features(tokens, i), (i in boundaries))
              for i in range(1, len(tokens)-1)
              if tokens[i] in '.?!']
>>> size = int(len(featureset) * 0.1)
>>> train_set, test_set = featureset[size:], featureset[:size]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.9983388704318937

到目前为止一切顺利。但是当我尝试使用该功能来分割我的文本时,我收到了一个错误。

def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            sents.append(words[start:i+1])
            start = i+1
    if start < len(words):
        sents.append(words[start:])
    return sents
  

new_text = ['En','un','lugar','de','la','Mancha',',','de',   'cuyo','nombre','no','quiero','acordarme',',','no','ha',   'mucho','tiempo','que','vivía','un','hidalgo','de','los','de',   'lanza','en','astillero',',','adarga','antigua',',','rocín',   'flaco','y','galgo','corredor','。','Una','olla','de','algo',   'más','vaca','que','carnero',',','salpicón','las','más',   'noches',',','duelos','y','quebrantos','los','sábados',',',   'lantejas','los','viernes',',','algún','palomino','de',   'añadidura','los','domingos',',','consumían','las','tres',   'partes','de','su','hacienda','。','El','resto','della',   'concluían','sayo','de','velarte',',','calzas','de','velludo',   'para','las','fiestas',',','con','sus','pantuflos','de','lo',   'mesmo',',','y','los','días','de','entresemana','se',   'honraba','con','su','vellorí','de','lo','más','fino','。']

segment_sentences(new_text)
Traceback (most recent call last):
  File "<pyshell#31>", line 1, in <module>
    segment_sentences(texto)
  File "<pyshell#26>", line 5, in segment_sentences
    if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
  File "<pyshell#16>", line 2, in punct_features
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
IndexError: list index out of range

我一直在调整一些数字以确定我是否可以修复索引超出范围错误,但它不起作用。

感谢任何帮助

1 个答案:

答案 0 :(得分:2)

看起来您需要循环enumerate(words[:-1])而不是enumerate(words)

正如您所写,您在列表中的最后一个单词上调用punct_features(words, i)。当列表中的最后一个单词的索引(i)传递给punct_features()时,您会尝试访问words[i+1]tokens[i+1]。因为只有{{1} i中的项目,您获得words