Python sklearn逻辑回归K-hold交叉验证:如何为coef_创建drameframe

时间:2017-03-05 02:25:06

标签: python scikit-learn logistic-regression cross-validation

Python3.5

我有一个存储在varibale file中的数据集,我尝试使用逻辑回归应用10次保持交叉验证。我要找的是列出clf.coef_的平均值的方法。

print(file.head())

   Result  Interest  Limit  Service  Convenience  Trust  Speed 
0       0         1      1        1            1      1      1   
1       0         1      1        1            1      1      1   
2       0         1      1        1            1      1      1   
3       0         4      4        3            4      2      3   
4       1         4      4        4            4      4      4 

以下是我编写的简单逻辑回归代码,以显示coef_的列表。

[IN]

import pandas as pd
from pandas import DataFrame
import numpy as np
from sklearn import cross_validation
from sklearn.linear_model import LogisticRegression

X = file.drop(['Result'],1)
y = file['Result']

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X,y,test_size=0.25)
clf = LogisticRegression(penalty='l1')
clf.fit(X_train,y_train)
accuracy = clf.score(X_test,y_test)
print(accuracy)

coeff_df = pd.DataFrame([X.columns, clf.coef_[0]]).T
print(coeff_df)

[OUT]

0.823061630219  

             0          1
0     Interest   0.163577
1        Limit  -0.161104
2      Service   0.323073
3  Convenience   0.121573
4        Trust   0.370012
5        Speed   0.089934
6        Major   0.183002
7          Ads  0.0137151

然后,我尝试将10倍交叉验证应用于同一数据集。我有一个代码,但我无法生成一个coef _,coeff_df列表的数据帧,就像我上面的分析一样。有人可以提供解决方案吗?

[IN]

from sklearn.cross_validation import cross_val_score
scores = cross_val_score(clf, X, y, cv=10)
print (scores)
print (np.average(scores))

[OUT]

[ 0.82178218  0.7970297   0.84158416  0.80693069  0.84158416  0.80693069
  0.825       0.825       0.815       0.76      ]
0.814084158416

1 个答案:

答案 0 :(得分:2)

cross_val_score是一个辅助函数,它包含scikit-learn的各种对象以进行交叉验证(例如KFoldStratifiedKFold)。它会根据所使用的scoring参数返回分数列表(对于分类问题,我相信默认情况下会为accuracy。)

cross_val_score的返回对象不允许您访问交叉验证中使用的基础折叠/模型,这意味着您无法获得每个模型的系数。

要获得每个交叉验证折叠的系数,您需要使用KFold(或者如果您的课程不平衡,StratifiedKFold)。

import pandas as pd
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LogisticRegression

df = pd.read_clipboard()
file = pd.concat([df, df, df]).reset_index()

X = file.drop(['Result'],1)
y = file['Result']

skf = StratifiedKFold(n_splits=2, random_state=0)

models, coefs = [], []  # in case you want to inspect the models later, too
for train, test in skf.split(X, y):
    print(train, test)
    clf = LogisticRegression(penalty='l1')
    clf.fit(X.loc[train], y.loc[train])
    models.append(clf)
    coefs.append(clf.coef_[0])

pd.DataFrame(coefs, columns=X.columns).mean()

获取我们:

Interest       0.000000
Limit          0.000000
Service        0.000000
Convenience    0.000000
Trust          0.530811
Speed          0.000000
dtype: float64

我必须从你的例子(只有一个正面类的实例)中组成数据。我怀疑这些数字在你的情况下不会为0。

修改 由于StratifiedKFold(或KFold)为我们提供了数据集的交叉验证分割,您仍然可以使用模型score方法计算交叉验证分数。

以下版本略微改变,以便捕获每个折叠的交叉验证分数。

models, scores, coefs = [], [], []  # in case you want to inspect the models later, too
for train, test in skf.split(X, y):
    print(train, test)
    clf = LogisticRegression(penalty='l1')
    clf.fit(X.loc[train], y.loc[train])
    score = clf.score(X.loc[test], y.loc[test])
    models.append(clf)
    scores.append(score)
    coefs.append(clf.coef_[0])