Question

我正在使用NLTK的FreqDist对象来制作cPickle文件。但是，出于某种原因，我遇到了第3行的索引越界错误（“cutoff ...”）

words = [item for sublist in words for item in sublist]
freq = nltk.FreqDist(words)
cutoff = scoreatpercentile(freq.values(),15)
vocab = [word for word,f in freq.items() if f > cutoff] 
cPickle.dump({'distribution':freq,'cutoff':cutoff},open('freqdist_2.pkl',WRITE))

错误读取

File "C:\Python27\lib\site-packages\scipy\stats\stats.py", line 1419, in scoreatpercentile
score = _interpolate(values[int(idx)], values[int(idx)+1],
IndexError: index out of bounds

此代码在其他计算机上运行得非常好......我不确定我在这里缺少什么。

Answer 1

在将其发送到scipy的nltk.FreqDist(words)函数之前，您需要调试scoreatpercentile中的内容。

如果你想要一种更简单的方法来获得分数，这里有一个例子：

from nltk.probability import FreqDist

words = "this is a foo bar bar bar bar black black sheep sentence".split()
sublist = "foo bar black sheep sentence".split()
words = [i for i in words if i in sublist]

word_freq = FreqDist(words)
cutoff = 15*sum(word_freq.values())/float(100)

vocab = [word for word,f in word_freq.items() if f > cutoff]

print vocab

[OUT]：

['bar', 'black']

IndexError：索引超出边界错误

1 个答案: