Naivebayes MultinomialNB scikit-learn / sklearn

时间:2018-05-24 18:17:49

标签: python python-3.x pandas machine-learning scikit-learn

我正在构建一个朴素的贝叶斯分类器,我在scikit-learn网站上按照教程进行操作。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
import csv
import string
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Importing dataset
data = pd.read_csv("test.csv", quotechar='"', delimiter=',',quoting=csv.QUOTE_ALL, skipinitialspace=True,error_bad_lines=False)
df2 = data.set_index("name", drop = False)



df2['sentiment'] = df2['rating'].apply(lambda rating : +1 if rating > 3 else -1)


train, test = train_test_split(df2, test_size=0.2)


count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(traintrain['review'])
test_matrix = count_vect.transform(testrain['review'])

clf = MultinomialNB().fit(X_train_tfidf, train['sentiment'])

第一个参数是词汇词典,它返回一个Document-Term矩阵。 应该是第二个参数,twenty_train.target?

修改数据示例

Name, review,rating
film1,......,1
film2, the film is....,5 
film3, film about..., 4

根据此说明我创建了一个新列,如果评级为> 3,那么评论为正,否则为负

df2['sentiment'] = df2['rating'].apply(lambda rating : +1 if rating > 3 else -1)

1 个答案:

答案 0 :(得分:3)

您的问题不是100%明确,但让我解释一下。

fit的{​​{1}}方法需要输入MultinomialNBx。 现在,y应该是训练向量(训练数据),x应该是目标值。

y

更详细:

clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)

注意:确保正确定义X : {array-like, sparse matrix}, shape = [n_samples, n_features] Training vectors, where n_samples is the number of samples and n_features is the number of features. y : array-like, shape = [n_samples] Target values. shape = [n_samples, n_features]的{​​{1}}和shape = [n_samples]。否则,x将抛出错误。

玩具示例:

y

修改

fit只是一个包含from sklearn.datasets import fetch_20newsgroups from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn import metrics newsgroups_train = fetch_20newsgroups(subset='train') categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space'] newsgroups_train = fetch_20newsgroups(subset='train', categories=categories) vectorizer = TfidfVectorizer() # the following will be the training data vectors = vectorizer.fit_transform(newsgroups_train.data) vectors.shape newsgroups_test = fetch_20newsgroups(subset='test', categories=categories) # this is the test data vectors_test = vectorizer.transform(newsgroups_test.data) clf = MultinomialNB(alpha=.01) # the fitting is done using the TRAINING data # Check the shapes before fitting vectors.shape #(2034, 34118) newsgroups_train.target.shape #(2034,) # fit the model using the TRAINING data clf.fit(vectors, newsgroups_train.target) # the PREDICTION is done using the TEST data pred = clf.predict(vectors_test) 的{​​{1}}数组。

newsgroups_train.target

所以在这个例子中我们有4个不同的类/目标。

需要此变量才能适合分类器。