我在Python中编写了逻辑回归的代码(Anaconda 3.5.2 with sklearn 0.18.2)。我已经实现了GridSearchCV()
和train_test_split()
来对参数进行排序并拆分输入数据。
我的目标是在测试数据的标准误差上找到10倍以上的整体(平均)精度。此外,我尝试预测正确预测的类标签,创建混淆矩阵并准备分类报告摘要。
请在下面告诉我:
(1)我的代码是否正确?请检查每个部分。
(2)我尝试了两种不同的 Sklearn 函数,clf.score()
和clf.cv_results_
。我看到他们给出了不同的结果。哪一个是正确的? (但是,摘要不包括在内)。
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.pipeline import Pipeline
# Load any n x m data and label column. No missing or NaN values.
# I am skipping loading data part. One can load any data to test below code.
sc = StandardScaler()
lr = LogisticRegression()
pipe = Pipeline(steps=[('sc', sc), ('lr', lr)])
parameters = {'lr__C': [0.001, 0.01]}
if __name__ == '__main__':
clf = GridSearchCV(pipe, parameters, n_jobs=-1, cv=10, refit=True)
X_train, X_test, y_train, y_test = train_test_split(Data, labels, random_state=0)
# Train the classifier on data1's feature and target data
clf.fit(X_train, y_train)
print("Accuracy on training set: {:.2f}% \n".format((clf.score(X_train, y_train))*100))
print("Accuracy on test set: {:.2f}%\n".format((clf.score(X_test, y_test))*100))
print("Best Parameters: ")
print(clf.best_params_)
# Alternately using cv_results_
print("Accuracy on training set: {:.2f}% \n", (clf.cv_results_['mean_train_score'])*100))
print("Accuracy on test set: {:.2f}%\n", (clf.cv_results_['mean_test_score'])*100))
# Predict class labels
y_pred = clf.best_estimator_.predict(X_test)
# Confusion Matrix
class_names = ['Positive', 'Negative']
confMatrix = confusion_matrix(y_test, y_pred)
print(confMatrix)
# Accuracy Report
classificationReport = classification_report(labels, y_pred, target_names=class_names)
print(classificationReport)
我将不胜感激任何建议。
答案 0 :(得分:1)
首先,所需的指标,即即准确度指标已被视为LogisticRegression()
的默认得分者。因此,我们可能省略定义scoring='accuracy'
的{{1}}参数。
其次,参数GridSearchCV()
返回所选指标的值 IF 在对score(X, y)
中的所有可能选项进行排序后,已使用best_estimator_
进行了重新设置。 {1}}。它就像你提供param_grid
一样。请注意refit=True
。因此,它不会打印出平均指标,而是打印出最佳指标。
第三,参数clf.score(X, y) == clf.best_estimator_.score(X, y)
是一个更广泛的摘要,因为它包含了每个拟合的结果。但是,它会打印出通过平均批处理结果获得的平均结果。这些是您希望存储的值。
让我在此介绍一个玩具示例,以便更好地理解:
cv_results_
此代码产生以下内容:
0.98107957707289928#这是最佳准确度分数
{' mean_fit_time':数组([0.15465896,0.23701136]),
' mean_score_time':数组([0.0006465,0.00065773]),
' mean_test_score':数组([0.934335,0.9376739]),
' mean_train_score':数组([0.96475625,0.98225632]),
' param_C&#39 ;: masked_array(data = [0.001 0.01],
' params':({' C':0.001},{' C':0.01})
from sklearn.datasets import load_digits
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 0.2)
param_grid = {'C': [0.001, 0.01]}
clf = GridSearchCV(cv=10, estimator=LogisticRegression(), refit=True,
param_grid=param_grid)
clf.fit(X_train, y_train)
clf.best_estimator_.score(X_train, y_train)
print('____')
clf.cv_results_
有两个平均值,因为我为mean_train_score
参数选择了两个选项。
我希望有所帮助!