Python-NLTK训练/测试拆分

时间:2018-07-13 13:48:36

标签: python machine-learning scikit-learn nlp nltk

我一直关注SentDex关于NLTK和Python的video series,并构建了一个脚本,该脚本使用各种模型来确定评论情绪,例如逻辑回归。我担心的是,我认为SentDex的方法在确定要用于训练的单词时包括测试集,这显然不是可取的(训练/测试拆分是在特征选择之后进行的)。

(根据Mohammed Kashif的评论进行编辑)

完整代码:

import nltk
import numpy as np
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify import ClassifierI
from nltk.corpus import movie_reviews
from sklearn.naive_bayes import MultinomialNB

documents = [ (list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category) ]

all_words = []

for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(documents):
    words = set(documents)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]

np.random.shuffle(featuresets)

training_set = featuresets[:1800]
testing_set = featuresets[1800:]

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

已经尝试:

documents = [ (list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category) ]

np.random.shuffle(documents)

training_set = documents[:1800]
testing_set = documents[1800:]

all_words = []

for w in documents.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(training_set):
    words = set(training_set)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

featuresets = [(find_features(rev), category) for (rev, category) in training_set]

np.random.shuffle(featuresets)

training_set = featuresets
testing_set = testing_set

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)

产生错误:

  

回溯(最近通话最近一次):

     

文件“”,第34行,在      print(“ MNB_classifier precision:”,(nltk.classify.accuracy(MNB_classifier,testing_set))* 100)

     

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ nltk \ classify \ util.py”的准确性,第87行      结果= classifier.classify_many([[fs为(fs,l)以黄金表示的]]

     

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ nltk \ classify \ scikitlearn.py”,行85,位于classify_many中      X = self._vectorizer.transform(featuresets)

     

转换中的文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ sklearn \ feature_extraction \ dict_vectorizer.py”,行291      返回self._transform(X,fitting = False)

     

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ sklearn \ feature_extraction \ dict_vectorizer.py”,行166,_transform      对于f,v表示六.iteritems(x):

     

文件“ C:\ ProgramData \ Anaconda3 \ lib \ site-packages \ sklearn \ externals \ six.py”,第439行,位于itemitems中      返回iter(getattr(d,_iteritems)(** kw))

     

AttributeError:“列表”对象没有属性“项目”

1 个答案:

答案 0 :(得分:1)

好的,所以代码中有几个错误。我们将一一介绍。

首先,您的documents列表是一个元组列表,并且没有words()方法。为了访问所有单词,请像这样更改for循环

all_words = []

for words_list, categ in documents:      #<-- each wordlist is a list of words
    for w in words_list:                 #<-- Then access each word in list
        all_words.append(w.lower())

第二,您需要同时为trainingtest集创建功能集。您仅使用了training_set的功能集。将代码更改为此

featuresets = [[文档中(rev,category)的(find_features(rev),category)]

np.random.shuffle(featuresets)

training_set = featuresets[:1800]
testing_set = featuresets[1800:]

所以最终代码变成

documents = [ (list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category) ]

np.random.shuffle(documents)

training_set = documents[:1800]
testing_set = documents[1800:]

all_words = []

for words_list, categ in documents:
    for w in words_list:
        all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)

word_features = list(all_words.keys())[:3000]

def find_features(training_set):
    words = set(training_set)
    features = {}
    for w in word_features:
        features[w] = (w in words)

    return features

featuresets = [(find_features(rev), category) for (rev, category) in documents]

np.random.shuffle(featuresets)

training_set = featuresets[:1800]
testing_set = featuresets[1800:]

MNB_classifier = SklearnClassifier(MultinomialNB())
MNB_classifier.train(training_set)
print("MNB_classifier accuracy:", (nltk.classify.accuracy(MNB_classifier, testing_set)) *100)