ValueError:无法处理未知和二进制的混合

时间:2014-06-09 21:12:46

标签: python scikit-learn sentiment-analysis multinomial

我最近使用了scikit-learn进行情绪分析,所以在我训练了我的标记数据然后尝试在未标记的数据集上测试它们时,出现这个错误'ValueError:无法处理连续多输出和二进制“

我认为我做错了就是我给(y_pred)错误的假设。

错误来自于:accuracy = classifier.score(test_matrix,ALL_test)

但是当我将ALL_test更改为ALL_train(经过训练和标记的数据)时,它的准确度为0.971251409245;这是绝对错误的

我该怎么办?

# -*- coding:utf-8 -*-
import sklearn.cross_validation
import sklearn.feature_extraction.text
import sklearn.metrics
import sklearn.naive_bayes
from sklearn import svm
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, precision_score, recall_score


name = ['Tweet','Label']
name2 =['Tweet','Label']
data_train = pd.read_table('unstemmedtrain.csv',sep = ';',names = name)
data_test = pd.read_table('unstemmedtest.csv',names=name2)
train_data =pd.DataFrame(data_test,columns=name2)
test_data=pd.DataFrame(data_train,columns=name)

vectorizer =  sklearn.feature_extraction.text.CountVectorizer()

train_matrix = vectorizer.fit_transform(train_data['Tweet'])
test_matrix = vectorizer.transform(test_data['Tweet'])
#print train_matrix

positive_train = (train_data['Label']=='positive')
negative_train= (train_data['Label']=='negative')
neutral_train=(train_data['Label']=='neutral')
#print negative_cases_train
ALL_train = positive_train +negative_train +neutral_train
#print positive_cases_train
ALL_test = (test_data['Tweet'])
positive_test =(test_data['Label']=='positive')
negative_test =(test_data['Label']=='negative')
neutral_test = (test_data['Label']=='neutral')
ALL_Test = positive_test + negative_test + neutral_test

#print positive_cases_test


classifier=sklearn.naive_bayes.MultinomialNB()
classifier2 = classifier.fit(train_matrix,ALL_train)

p_sentiment = classifier.predict(test_matrix)
p_prob = classifier.predict_proba(test_matrix)
#print predicted_prob
accuracy = classifier.score(test_matrix,ALL_test)
print accuracy

2 个答案:

答案 0 :(得分:1)

我在这里看到一些问题。

  1. 您是在尝试预测哪些推文是正面的,哪些是否定的,哪些是中性的,或者您是在尝试预测推文是正/负/中性?你在做后者。我们假设train_data['Label'] = ['positive', 'positive', 'negative', 'neutral']。所以你的代码确实:

    positive_train = (train_data['Label']=='positive') # = [True, True, False, False]
    negative_train= (train_data['Label']=='negative') # = [False, False, True, False]
    neutral_train=(train_data['Label']=='neutral') # = [False, False, False, True]
    ALL_train = positive_train +negative_train +neutral_train # = [True, True, True, True]
    
  2. 您提供的分数函数ALL_test = (test_data['Tweet'])是文本,而不是ALL_Test = positive_test + negative_test + neutral_test,这是您的真实情况。这就是异常的来源。我不知道你为什么需要All_test,但如果你这样做,请用不同的名字命名 - 这让你感到困惑。

答案 1 :(得分:0)

您必须将All_train传递给classifier.score

如:

accuracy = classifier.score(test_matrix,ALL_train)
print accuracy

如果您想评估模型的测试数据,那么Recall,precision,f1得分和auc_score可能会有所帮助