NLTK朴素贝叶斯分类器奇怪的结果

时间:2013-05-10 19:47:17

标签: python machine-learning nltk

我正在尝试使用nltk天真分类器来分类电影类型。然而,我得到了一些奇怪的结果。目前只根据输入的流派数量进行猜测。

如果我输入两部动作片,一部喜剧则每次猜测都会动作。当然我希望它基于输入的文本:

def RemoveStopWords(wordText):
   keep_list = []
   for word in wordText:
        if word not in wordStop:
            keep_list.append(word.lower())

   return set(keep_list)

def getFeatures(element):

   splitter=re.compile('\\W*')
   f = {}
   plot = [s for s in RemoveStopWords(splitter.split(element['imdb']['plot']))
   if len(s)>5 and len(s) < 15]

   for w in plot:
           f[w]= w

   return f

def FindFeaturesForList(MovieList):
    featureSet = []
    for w in MovieList:
        print w['imdb']['title']
        try:
            for genre in w['imdb']['genres']:
                featureSet.append((getFeatures(w), genre))
        except:
            print "Error when retriving genre, skipping element"

    return featureSet

featureList = FindFeaturesForList(trainset)
cl = nltk.NaiveBayesClassifier.train(featureList)

所以每当我做一个cl.classify(电影)时,它会返回最频繁的输入类型,我做错了什么?

1 个答案:

答案 0 :(得分:0)

在nltk书中的电影评论classification example中,注意到收集了所有电影中所有单词的频率,然后只选择了最常用的单词作为功能键。

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = all_words.keys()[:2000]

我认为重要的是要注意这是选择。以这种方式选择功能键不是强制性的。其他一些巧妙的功能选择可能会导致更好的分类器。选择好的特征是科学背后的艺术。

无论如何,也许在分类器中尝试使用相同的想法:

def getFeatures(text, word_features):
    text = text.lower()
    f = {word: word in text for word in word_features}
    return f


def FindFeaturesForList(MovieList):
    featureSet = []
    splitter = re.compile('\\W*')
    all_words = nltk.FreqDist(
        s.lower()
        for w in MovieList
        for s in RemoveStopWords(splitter.split(w['imdb']['plot']))
        if len(s) > 5 and len(s) < 15)
    word_features = all_words.keys()[:2000]
    for w in MovieList:
        print w['imdb']['title']
        try:
            for genre in w['imdb']['genres']:
                featureSet.append(
                    (getFeatures(w['imdb']['plot'], word_features), genre))
        except:
            print "Error when retriving genre, skipping element"

    return featureSet