我正在阅读“使用Python进行自然语言处理”一书中的第6章(http://www.nltk.org/book/ch06.html)
我试图用cess_esp语料库复制句子分割的实验。我逐行遵循代码,它似乎有效,直到我尝试使用它来分割我自己的文本。
>>> import nltk
>>> from nltk.corpus import cess_esp
>>> sentences = cess_esp.sents()
>>> tokens = []
>>> boundaries = set()
>>> offset = 0
>>> for sent in sentences:
tokens.extend(sent)
offset += len(sent)
boundaries.add(offset-1)
>>> def punct_features(tokens,i):
return {'next-word-capitalized': tokens[i+1][0].isupper(),
'prev-word': tokens[i-1].lower(),
'punct': tokens[i],
'prev-word-is-one-char': len(tokens[i-1]) == 1}
>>> featureset = [(punct_features(tokens, i), (i in boundaries))
for i in range(1, len(tokens)-1)
if tokens[i] in '.?!']
>>> size = int(len(featureset) * 0.1)
>>> train_set, test_set = featureset[size:], featureset[:size]
>>> classifier = nltk.NaiveBayesClassifier.train(train_set)
>>> nltk.classify.accuracy(classifier, test_set)
0.9983388704318937
到目前为止一切顺利。但是当我尝试使用该功能来分割我的文本时,我收到了一个错误。
def segment_sentences(words):
start = 0
sents = []
for i, word in enumerate(words):
if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
sents.append(words[start:i+1])
start = i+1
if start < len(words):
sents.append(words[start:])
return sents
new_text = ['En','un','lugar','de','la','Mancha',',','de', 'cuyo','nombre','no','quiero','acordarme',',','no','ha', 'mucho','tiempo','que','vivía','un','hidalgo','de','los','de', 'lanza','en','astillero',',','adarga','antigua',',','rocín', 'flaco','y','galgo','corredor','。','Una','olla','de','algo', 'más','vaca','que','carnero',',','salpicón','las','más', 'noches',',','duelos','y','quebrantos','los','sábados',',', 'lantejas','los','viernes',',','algún','palomino','de', 'añadidura','los','domingos',',','consumían','las','tres', 'partes','de','su','hacienda','。','El','resto','della', 'concluían','sayo','de','velarte',',','calzas','de','velludo', 'para','las','fiestas',',','con','sus','pantuflos','de','lo', 'mesmo',',','y','los','días','de','entresemana','se', 'honraba','con','su','vellorí','de','lo','más','fino','。']
segment_sentences(new_text)
Traceback (most recent call last):
File "<pyshell#31>", line 1, in <module>
segment_sentences(texto)
File "<pyshell#26>", line 5, in segment_sentences
if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
File "<pyshell#16>", line 2, in punct_features
return {'next-word-capitalized': tokens[i+1][0].isupper(),
IndexError: list index out of range
我一直在调整一些数字以确定我是否可以修复索引超出范围错误,但它不起作用。
感谢任何帮助
答案 0 :(得分:2)
看起来您需要循环enumerate(words[:-1])
而不是enumerate(words)
。
正如您所写,您在列表中的最后一个单词上调用punct_features(words, i)
。当列表中的最后一个单词的索引(i
)传递给punct_features()
时,您会尝试访问words[i+1]
(tokens[i+1]
。因为只有{{1} i
中的项目,您获得words
。