Scikit - 仅根据网格分数学习RFECV的功能数量

时间:2016-05-05 15:55:57

标签: algorithm python-2.7 machine-learning scikit-learn

从scikit-learn RFE documentation开始,算法选择连续较小的特征集,并且仅保留具有最高权重的特征。删除权重较低的功能,此过程会重复,直到剩余的功能数量与用户指定的功能匹配(或默认情况下为原始功能数量的一半)。

RFECV docs表示功能按RFE和KFCV排名。

我们在documentation example for RFECV

中显示的代码中有一组25个功能
from sklearn.svm import SVC
from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV,RFE
from sklearn.datasets import make_classification

# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=3,
                           n_redundant=2, n_repeated=0, n_classes=8,
                           n_clusters_per_class=1, random_state=0)

# Create the RFE object and compute a cross-validated score.
svc = SVC(kernel="linear")
# The "accuracy" scoring is proportional to the number of correct
# classifications
rfecv = RFECV(estimator=svc, step=1, cv=StratifiedKFold(y, 2),scoring='accuracy')
rfecv.fit(X, y)
rfe = RFE(estimator=svc, step=1)
rfe.fit(X, y)

print('Original number of features is %s' % X.shape[1])
print("RFE final number of features : %d" % rfe.n_features_)
print("RFECV final number of features : %d" % rfecv.n_features_)
print('')

import numpy as np
g_scores = rfecv.grid_scores_
indices = np.argsort(g_scores)[::-1]
print('Printing RFECV results:')
for f in range(X.shape[1]):
    print("%d. Number of features: %d;
                  Grid_Score: %f" % (f + 1, indices[f]+1, g_scores[indices[f]]))

以下是我得到的输出:

Original number of features is 25
RFE final number of features : 12
RFECV final number of features : 3

Printing RFECV results:
1. Number of features: 3; Grid_Score: 0.818041
2. Number of features: 4; Grid_Score: 0.816065
3. Number of features: 5; Grid_Score: 0.816053
4. Number of features: 6; Grid_Score: 0.799107
5. Number of features: 7; Grid_Score: 0.797047
6. Number of features: 8; Grid_Score: 0.783034
7. Number of features: 10; Grid_Score: 0.783022
8. Number of features: 9; Grid_Score: 0.781992
9. Number of features: 11; Grid_Score: 0.778028
10. Number of features: 12; Grid_Score: 0.774052
11. Number of features: 14; Grid_Score: 0.762015
12. Number of features: 13; Grid_Score: 0.760075
13. Number of features: 15; Grid_Score: 0.752003
14. Number of features: 16; Grid_Score: 0.750015
15. Number of features: 18; Grid_Score: 0.750003
16. Number of features: 22; Grid_Score: 0.748039
17. Number of features: 17; Grid_Score: 0.746003
18. Number of features: 19; Grid_Score: 0.739105
19. Number of features: 20; Grid_Score: 0.739021
20. Number of features: 21; Grid_Score: 0.738003
21. Number of features: 23; Grid_Score: 0.729068
22. Number of features: 25; Grid_Score: 0.725056
23. Number of features: 24; Grid_Score: 0.725044
24. Number of features: 2; Grid_Score: 0.506952
25. Number of features: 1; Grid_Score: 0.272896

在这个特定的例子中:

  1. 对于RFE:代码总是返回12个功能(大约25个的一半,正如文档中所预期的那样)
  2. 对于RFECV,代码返回1-25的不同数字(不是功能数量的一半)
  3. 在我看来,当选择RFECV时,仅根据KFCV分数挑选特征的数量 - 即交叉验证分数超过RFE连续修剪特征。

    这是真的吗?如果想使用本机递归特征消除算法,那么RFECV是使用这种算法还是使用它的混合版本?

    在RFECV中,是否对修剪后剩余的特征子集进行了交叉验证?如果是这样,RFECV中每次修剪后会保留多少个特征?

1 个答案:

答案 0 :(得分:2)

在交叉验证版本中,功能在每个步骤重新排名,排名最低的功能被删除 - 这在文档中称为“递归功能选择”。

如果您想将其与天真版本进行比较,则需要计算RFE所选功能的交叉验证分数。我的猜测是RFECV的答案是正确的 - 从功能下降时模型性能的急剧增加来判断,你可能有一些高度相关的功能会损害你模型的性能。