我正在尝试使用nltk天真分类器来分类电影类型。然而,我得到了一些奇怪的结果。目前只根据输入的流派数量进行猜测。
如果我输入两部动作片,一部喜剧则每次猜测都会动作。当然我希望它基于输入的文本:
def RemoveStopWords(wordText):
keep_list = []
for word in wordText:
if word not in wordStop:
keep_list.append(word.lower())
return set(keep_list)
def getFeatures(element):
splitter=re.compile('\\W*')
f = {}
plot = [s for s in RemoveStopWords(splitter.split(element['imdb']['plot']))
if len(s)>5 and len(s) < 15]
for w in plot:
f[w]= w
return f
def FindFeaturesForList(MovieList):
featureSet = []
for w in MovieList:
print w['imdb']['title']
try:
for genre in w['imdb']['genres']:
featureSet.append((getFeatures(w), genre))
except:
print "Error when retriving genre, skipping element"
return featureSet
featureList = FindFeaturesForList(trainset)
cl = nltk.NaiveBayesClassifier.train(featureList)
所以每当我做一个cl.classify(电影)时,它会返回最频繁的输入类型,我做错了什么?
答案 0 :(得分:0)
在nltk书中的电影评论classification example中,注意到收集了所有电影中所有单词的频率,然后只选择了最常用的单词作为功能键。
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]
我认为重要的是要注意这是选择。以这种方式选择功能键不是强制性的。其他一些巧妙的功能选择可能会导致更好的分类器。选择好的特征是科学背后的艺术。
无论如何,也许在分类器中尝试使用相同的想法:
def getFeatures(text, word_features):
text = text.lower()
f = {word: word in text for word in word_features}
return f
def FindFeaturesForList(MovieList):
featureSet = []
splitter = re.compile('\\W*')
all_words = nltk.FreqDist(
s.lower()
for w in MovieList
for s in RemoveStopWords(splitter.split(w['imdb']['plot']))
if len(s) > 5 and len(s) < 15)
word_features = all_words.keys()[:2000]
for w in MovieList:
print w['imdb']['title']
try:
for genre in w['imdb']['genres']:
featureSet.append(
(getFeatures(w['imdb']['plot'], word_features), genre))
except:
print "Error when retriving genre, skipping element"
return featureSet