Question

Python3.5

我有一个存储在varibale file中的数据集，我尝试使用逻辑回归应用10次保持交叉验证。我要找的是列出clf.coef_的平均值的方法。

print(file.head())

   Result  Interest  Limit  Service  Convenience  Trust  Speed 
0       0         1      1        1            1      1      1   
1       0         1      1        1            1      1      1   
2       0         1      1        1            1      1      1   
3       0         4      4        3            4      2      3   
4       1         4      4        4            4      4      4

以下是我编写的简单逻辑回归代码，以显示coef_的列表。

[IN]

import pandas as pd
from pandas import DataFrame
import numpy as np
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

X = file.drop(['Result'],1)
y = file['Result']

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.25)
clf = LogisticRegression(penalty='l1')
clf.fit(X_train,y_train)
accuracy = clf.score(X_test,y_test)
print(accuracy)

coeff_df = pd.DataFrame([X.columns, clf.coef_[0]]).T
print(coeff_df)

[OUT]

0.823061630219  

             0          1
0     Interest   0.163577
1        Limit  -0.161104
2      Service   0.323073
3  Convenience   0.121573
4        Trust   0.370012
5        Speed   0.089934
6        Major   0.183002
7          Ads  0.0137151

然后，我尝试将10倍交叉验证应用于同一数据集。我有一个代码，但我无法生成一个coef _，coeff_df列表的数据帧，就像我上面的分析一样。有人可以提供解决方案吗？

[IN]

from sklearn.cross_validation import cross_val_score
scores = cross_val_score(clf, X, y, cv=10)
print (scores)
print (np.average(scores))

[OUT]

[ 0.82178218  0.7970297   0.84158416  0.80693069  0.84158416  0.80693069
  0.825       0.825       0.815       0.76      ]
0.814084158416

Answer 1

cross_val_score是一个辅助函数，它包含scikit-learn的各种对象以进行交叉验证（例如KFold，StratifiedKFold）。它会根据所使用的scoring参数返回分数列表（对于分类问题，我相信默认情况下会为accuracy。）

cross_val_score的返回对象不允许您访问交叉验证中使用的基础折叠/模型，这意味着您无法获得每个模型的系数。

要获得每个交叉验证折叠的系数，您需要使用KFold（或者如果您的课程不平衡，StratifiedKFold）。

import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

df = pd.read_clipboard()
file = pd.concat([df, df, df]).reset_index()

X = file.drop(['Result'],1)
y = file['Result']

skf = StratifiedKFold(n_splits=2, random_state=0)

models, coefs = [], []  # in case you want to inspect the models later, too
for train, test in skf.split(X, y):
    print(train, test)
    clf = LogisticRegression(penalty='l1')
    clf.fit(X.loc[train], y.loc[train])
    models.append(clf)
    coefs.append(clf.coef_[0])

pd.DataFrame(coefs, columns=X.columns).mean()

获取我们：

Interest       0.000000
Limit          0.000000
Service        0.000000
Convenience    0.000000
Trust          0.530811
Speed          0.000000
dtype: float64

我必须从你的例子（只有一个正面类的实例）中组成数据。我怀疑这些数字在你的情况下不会为0。

修改由于StratifiedKFold（或KFold）为我们提供了数据集的交叉验证分割，您仍然可以使用模型score方法计算交叉验证分数。

以下版本略微改变，以便捕获每个折叠的交叉验证分数。

models, scores, coefs = [], [], [] # in case you want to inspect the models later, too for train, test in skf.split(X, y): print(train, test) clf = LogisticRegression(penalty='l1') clf.fit(X.loc[train], y.loc[train]) score = clf.score(X.loc[test], y.loc[test]) models.append(clf) scores.append(score) coefs.append(clf.coef_[0])

Python sklearn逻辑回归K-hold交叉验证：如何为coef_创建drameframe

1 个答案: