我在我的数据逻辑回归和朴素贝叶斯上运行两种不同的分类算法,但即使我改变训练和测试数据比率,它也能给我相同的准确度。以下是我正在使用的代码
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
df = pd.read_csv('Speed Dating.csv', encoding = 'latin-1')
X = pd.DataFrame()
X['d_age'] = df ['d_age']
X['match'] = df ['match']
X['importance_same_religion'] = df ['importance_same_religion']
X['importance_same_race'] = df ['importance_same_race']
X['diff_partner_rating'] = df ['diff_partner_rating']
# Drop NAs
X = X.dropna(axis=0)
# Categorical variable Match [Yes, No]
y = X['match']
# Drop y from X
X = X.drop(['match'], axis=1)
# Transformation
scalar = StandardScaler()
X = scalar.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Logistic Regression
model = LogisticRegression(penalty='l2', C=1)
model.fit(X_train, y_train)
print('Accuracy Score with Logistic Regression: ', accuracy_score(y_test, model.predict(X_test)))
#Naive Bayes
model_2 = GaussianNB()
model_2.fit(X_train, y_train)
print('Accuracy Score with Naive Bayes: ', accuracy_score(y_test, model_2.predict(X_test)))
print(model_2.predict(X_test))
每次准确度是否相同都有可能?
答案 0 :(得分:1)
这是在类频率不平衡时发生的常见现象,例如:几乎所有样品都属于一类。例如,如果80%的样本属于" No",那么分类器通常倾向于预测" No"因为这样一个微不足道的预测达到了你的火车组的最高整体精确度。
通常,在评估二元分类器的性能时,您不仅要考虑整体准确性。您必须考虑其他指标,例如ROC曲线,类精度,f1分数等。
在您的情况下,您可以使用sklearns classification report来更好地了解分类器实际学习的内容:
from sklearn.metrics import classification_report
print(classification_report(y_test, model_1.predict(X_test)))
print(classification_report(y_test, model_2.predict(X_test)))
它将打印每个班级的精确度,召回率和准确度。
如何在课堂上达到更好的分类准确度有三种选择"是"