我正在尝试在两个文件之间进行评分。两者具有相同的数据,但标签不同。火车数据中的标签正确无误,而测试数据中的标签不一定正确...我想知道准确性,召回率和f分数。
import pandas
import numpy as np
import pandas as pd
from sklearn import metrics
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_score, recall_score, confusion_matrix, classification_report, accuracy_score, f1_score
df_train = pd.read_csv('train.csv', sep = ',')
df_test = pd.read_csv('teste.csv', sep = ',')
vec_train = TfidfVectorizer()
X_train = vec_train.fit_transform(df_train['text'])
y_train = df_train['label']
vec_test = TfidfVectorizer()
X_test = vec_test.fit_transform(df_train['text'])
y_test = df_test['label']
clf = LogisticRegression(penalty='l2', multi_class = 'multinomial',solver ='newton-cg')
y_pred = clf.predict(X_test)
print ("Accuracy on training set:")
print (clf.score(X_train, y_train))
print ("Accuracy on testing set:")
print (clf.score(X_test, y_test))
print ("Classification Report:")
print (metrics.classification_report(y_test, y_pred))
一个愚蠢的数据示例:
TRAIN
text,label
dogs are cool,animal
flowers are beautifil,plants
pen is mine,objet
beyonce is an artist,person
TEST
text,label
dogs are cool,objet
flowers are beautifil,plants
pen is mine,person
beyonce is an artist,animal
错误:
回溯(最近通话最近一次):
文件“ accuracy.py”,第30行,在 y_pred = clf.predict(X_test)
预测中的文件“ /usr/lib/python3/dist-packages/sklearn/linear_model/base.py”,第324行 分数= self.decision_function(X)
decision_function中的文件“ /usr/lib/python3/dist-packages/sklearn/linear_model/base.py”,第298行 “还”%{'name':类型(自己)。名称}) sklearn.exceptions.NotFittedError:此LogisticRegression实例尚未安装
我只是想计算测试的准确性
答案 0 :(得分:1)
您必须先使用 m <- mongo(collection = "diamonds", url = "mongodb://127.0.0.1")
训练分类器对象,然后才能在X_train
上使用预测函数。像这样
X_test
答案 1 :(得分:1)
您正在为测试数据添加新的TfidfVectorizer
。这将产生错误的结果。您应该使用适合火车数据的同一对象。
执行以下操作:
vec_train = TfidfVectorizer()
X_train = vec_train.fit_transform(df_train['text'])
X_test = vec_train.transform(df_test['text'])
此后,正如@MohammedKashif所说,您需要首先训练LogisticRegression模型,然后在测试中进行预测。
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
之后,您可以使用评分代码而不会出现任何错误。