Python函数`cls.score`和`cls.cv_result_`之间的区别

时间:2017-07-22 03:47:03

标签: python scikit-learn classification logistic-regression cross-validation

我在Python中编写了逻辑回归的代码(Anaconda 3.5.2 with sklearn 0.18.2)。我已经实现了GridSearchCV()train_test_split()来对参数进行排序并拆分输入数据。

我的目标是在测试数据的标准误差上找到10倍以上的整体(平均)精度。此外,我尝试预测正确预测的类标签,创建混淆矩阵并准备分类报告摘要。

请在下面告诉我:

(1)我的代码是否正确?请检查每个部分。

(2)我尝试了两种不同的 Sklearn 函数,clf.score()clf.cv_results_。我看到他们给出了不同的结果。哪一个是正确的? (但是,摘要不包括在内)。

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.pipeline import Pipeline

# Load any n x m data and label column. No missing or NaN values.
# I am skipping loading data part. One can load any data to test below code.

sc = StandardScaler()
lr = LogisticRegression()
pipe = Pipeline(steps=[('sc', sc), ('lr', lr)])
parameters = {'lr__C': [0.001, 0.01]}

if __name__ == '__main__':

        clf = GridSearchCV(pipe, parameters, n_jobs=-1, cv=10, refit=True)

        X_train, X_test, y_train, y_test = train_test_split(Data, labels, random_state=0)


       # Train the classifier on data1's feature and target data
        clf.fit(X_train, y_train)

        print("Accuracy on training set: {:.2f}% \n".format((clf.score(X_train, y_train))*100))
        print("Accuracy on test set: {:.2f}%\n".format((clf.score(X_test, y_test))*100))
        print("Best Parameters: ")
        print(clf.best_params_)

     # Alternately using cv_results_
       print("Accuracy on training set: {:.2f}% \n", (clf.cv_results_['mean_train_score'])*100))
       print("Accuracy on test set: {:.2f}%\n", (clf.cv_results_['mean_test_score'])*100))

    # Predict class labels
    y_pred = clf.best_estimator_.predict(X_test)

    # Confusion Matrix
    class_names = ['Positive', 'Negative']
    confMatrix = confusion_matrix(y_test, y_pred)
    print(confMatrix)

    # Accuracy Report
    classificationReport = classification_report(labels, y_pred, target_names=class_names)
    print(classificationReport)

我将不胜感激任何建议。

1 个答案:

答案 0 :(得分:1)

  • 首先,所需的指标,即即准确度指标已被视为LogisticRegression()的默认得分者。因此,我们可能省略定义scoring='accuracy'的{​​{1}}参数。

  • 其次,参数GridSearchCV()返回所选指标的值 IF 在对score(X, y)中的所有可能选项进行排序后,已使用best_estimator_进行了重新设置。 {1}}。它就像你提供param_grid一样。请注意refit=True。因此,它不会打印出平均指标,而是打印出最佳指标。

  • 第三,参数clf.score(X, y) == clf.best_estimator_.score(X, y)是一个更广泛的摘要,因为它包含了每个拟合的结果。但是,它会打印出通过平均批处理结果获得的平均结果。这些是您希望存储的值。

快速示例

让我在此介绍一个玩具示例,以便更好地理解:

cv_results_

此代码产生以下内容:

  

0.98107957707289928#这是最佳准确度分数

           

{' mean_fit_time':数组([0.15465896,0.23701136]),

     

' mean_score_time':数组([0.0006465,0.00065773]),

     

' mean_test_score':数组([0.934335,0.9376739]),

     

' mean_train_score':数组([0.96475625,0.98225632]),

     

' param_C&#39 ;: masked_array(data = [0.001 0.01],

     

' params':({' C':0.001},{' C':0.01})

from sklearn.datasets import load_digits from sklearn.model_selection import GridSearchCV, train_test_split from sklearn.linear_model import LogisticRegression X, y = load_digits(return_X_y=True) X_train, X_test, y_train, y_test = train_test_split(X, y, 0.2) param_grid = {'C': [0.001, 0.01]} clf = GridSearchCV(cv=10, estimator=LogisticRegression(), refit=True, param_grid=param_grid) clf.fit(X_train, y_train) clf.best_estimator_.score(X_train, y_train) print('____') clf.cv_results_ 有两个平均值,因为我为mean_train_score参数选择了两个选项。

我希望有所帮助!