我一直在python中使用maxent分类器并且它失败了,我不明白为什么。
我正在使用电影评论语料库。 (总菜鸟)
import nltk.classify.util
from nltk.classify import MaxentClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
classifier = MaxentClassifier.train(trainfeats)
这是错误(我知道我做错了,请链接到Maxent如何工作)
警告(来自警告模块): 文件“C:\ Python27 \ lib \ site-packages \ nltk \ classify \ maxent.py”,第1334行 sum1 = numpy.sum(exp_nf_delta * A,轴= 0) RuntimeWarning:在乘法中遇到无效值
警告(来自警告模块): 文件“C:\ Python27 \ lib \ site-packages \ nltk \ classify \ maxent.py”,第1335行 sum2 = numpy.sum(nf_exp_nf_delta * A,轴= 0) RuntimeWarning:在乘法中遇到无效值
警告(来自警告模块): 文件“C:\ Python27 \ lib \ site-packages \ nltk \ classify \ maxent.py”,第1341行 增量 - =(ffreq_empirical - sum1)/ -sum2 运行时警告:除法中遇到无效值
答案 0 :(得分:6)
我改变并稍微更新了代码。
import nltk, nltk.classify.util, nltk.metrics
from nltk.classify import MaxentClassifier
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
from sklearn import cross_validation
from nltk.classify import MaxentClassifier
from nltk.corpus import movie_reviews
def word_feats(words):
return dict([(word, True) for word in words])
negids = movie_reviews.fileids('neg')
posids = movie_reviews.fileids('pos')
negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]
posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]
negcutoff = len(negfeats)*3/4
poscutoff = len(posfeats)*3/4
trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]
#classifier = nltk.MaxentClassifier.train(trainfeats)
algorithm = nltk.classify.MaxentClassifier.ALGORITHMS[0]
classifier = nltk.MaxentClassifier.train(trainfeats, algorithm,max_iter=3)
classifier.show_most_informative_features(10)
all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])
def word_feats(words):
return {word:True for word in words if word in top_words}
答案 1 :(得分:3)
对numpy
溢出问题可能有一个修复,但由于这只是一个用于学习NLTK /文本分类的电影评论分类器(你可能不希望培训需要很长时间),我'将提供一个简单的解决方法:您可以限制功能集中使用的单词。
你可以在所有评论中找到300
最常用的单词(如果你愿意,你可以明显地提高),
all_words = nltk.FreqDist(word for word in movie_reviews.words())
top_words = set(all_words.keys()[:300])
然后,您需要做的就是在功能提取器中交叉引用top_words
以进行评论。此外,作为建议,使用字典理解而不是将list
tuple
转换为dict
更有效。所以这可能看起来像,
def word_feats(words):
return {word:True for word in words if word in top_words}