Question

我试图了解如何在逻辑回归交叉验证中计算最佳系数，其中“refit”参数为True。如果我正确理解docs，则最佳系数是首先确定最佳正则化参数“C”的结果，即，在所有折叠上具有最高平均分数的C的值。然后，最佳系数只是在最佳C得分最高的折叠上计算的系数。我假设如果最大得分达到几倍，则这些折叠的系数将被平均以给出最佳系数（我在文档中没有看到如何处理这个案例）。

为了测试我的理解，我用两种不同的方式确定了最佳系数：

直接来自拟合模型的 coef _ 属性，

coefs_paths

，其中包含跨越每个折叠然后跨越每个C进行交叉验证时获得的系数的路径。

我从1.和2.得到的结果相似但不完全相同，所以我希望有人能指出我在这里做错了什么。谢谢！

演示此问题的示例：

from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegressionCV 
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Set parameters
n_folds = 10
C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]

# Load and preprocess data
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
X_train_scaled = StandardScaler().fit_transform(X_train)

# Fit model
clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1', 
                           refit=True, scoring='roc_auc', 
                           solver='liblinear', random_state=0,
                           fit_intercept=False)
clf.fit(X_train_scaled, y_train)

########################
# Get and plot coefficients using method 1
########################
coefs1 = clf.coef_
coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
coefs1_series.sort_values().plot(kind="barh")

########################
# Get and plot coefficients using method 2
########################
# mean of scores of class "1"
scores = clf.scores_[1]
mean_scores = np.mean(scores, axis=0)
# Get index of the C that has the highest average score across all folds
best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
# Get index (here: indices) of the folds with highest scores for the 
# best C
best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]

paths = clf.coefs_paths_[1]  # has shape (n_folds, len(C_values), n_features)
coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
coefs2 = np.mean(coefs2, axis=0)
coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
coefs2_series.sort_values().plot(kind="barh")

scikit-learn LogisticRegressionCV：最佳系数

0 个答案: