scikit-learn LogisticRegressionCV:最佳系数

时间:2018-03-29 17:32:43

标签: scikit-learn logistic-regression cross-validation coefficients

我试图了解如何在逻辑回归交叉验证中计算最佳系数,其中“refit”参数为True。 如果我正确理解docs,则最佳系数是首先确定最佳正则化参数“C”的结果,即,在所有折叠上具有最高平均分数的C的值。然后,最佳系数只是在最佳C得分最高的折叠上计算的系数。我假设如果最大得分达到几倍,则这些折叠的系数将被平均以给出最佳系数(我在文档中没有看到如何处理这个案例)。

为了测试我的理解,我用两种不同的方式确定了最佳系数:

  1. 直接来自拟合模型的 coef _ 属性,
  2. 来自 coefs_paths 属性的
  3. ,其中包含跨越每个折叠然后跨越每个C进行交叉验证时获得的系数的路径。
  4. 我从1.和2.得到的结果相似但不完全相同,所以我希望有人能指出我在这里做错了什么。 谢谢!

    演示此问题的示例:

    from sklearn.datasets import load_breast_cancer
    import numpy as np
    import pandas as pd
    from sklearn.linear_model import LogisticRegressionCV 
    from sklearn.preprocessing import StandardScaler
    from sklearn.model_selection import train_test_split
    
    # Set parameters
    n_folds = 10
    C_values = [0.001, 0.01, 0.05, 0.1, 1., 100.]
    
    # Load and preprocess data
    cancer = load_breast_cancer()
    X, y = cancer.data, cancer.target
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
    X_train_scaled = StandardScaler().fit_transform(X_train)
    
    # Fit model
    clf = LogisticRegressionCV(Cs=C_values, cv=n_folds, penalty='l1', 
                               refit=True, scoring='roc_auc', 
                               solver='liblinear', random_state=0,
                               fit_intercept=False)
    clf.fit(X_train_scaled, y_train)
    
    ########################
    # Get and plot coefficients using method 1
    ########################
    coefs1 = clf.coef_
    coefs1_series = pd.Series(coefs1.ravel(), index=cancer['feature_names'])
    coefs1_series.sort_values().plot(kind="barh")
    
    ########################
    # Get and plot coefficients using method 2
    ########################
    # mean of scores of class "1"
    scores = clf.scores_[1]
    mean_scores = np.mean(scores, axis=0)
    # Get index of the C that has the highest average score across all folds
    best_C_idx = np.where(mean_scores==np.max(mean_scores))[0][0]
    # Get index (here: indices) of the folds with highest scores for the 
    # best C
    best_folds_idx = np.where(scores[:, best_C_idx]==np.max(scores[:, best_C_idx]))[0]
    
    paths = clf.coefs_paths_[1]  # has shape (n_folds, len(C_values), n_features)
    coefs2 = np.squeeze(paths[best_folds_idx, best_C_idx, :])
    coefs2 = np.mean(coefs2, axis=0)
    coefs2_series = pd.Series(coefs2.ravel(), index=cancer['feature_names'])
    coefs2_series.sort_values().plot(kind="barh")
    

0 个答案:

没有答案