Question

我仍然很擅长机器学习，并试图自己解决问题。我正在使用SciKit学习并拥有大约20,000个功能的推文数据集（n_features = 20,000）。到目前为止，我的精确度，召回率和f1得分都达到了79％左右。我想使用RFECV进行特征选择并提高模型的性能。我已经阅读了SciKit学习文档，但对如何使用RFECV仍然有点困惑。

这是我到目前为止的代码：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.cross_validation import cross_val_score
from sklearn.feature_selection import RFECV
from sklearn import metrics

# cross validation
sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
for train_index, test_index in sss:
    docs_train, docs_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

# feature extraction
count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
X_CV = count_vect.fit_transform(docs_train)

tfidf_transformer = TfidfTransformer()
X_tfidf = tfidf_transformer.fit_transform(X_CV)

# Create the RFECV object
nb = MultinomialNB(alpha=0.5)

# The "accuracy" scoring is proportional to the number of correct classifications
rfecv = RFECV(estimator=nb, step=1, cv=2, scoring='accuracy')

rfecv.fit(X_tfidf, y_train)
X_rfecv=rfecv.transform(X_tfidf)

print("Optimal number of features : %d" % rfecv.n_features_)

# train classifier
clf = MultinomialNB(alpha=0.5).fit(X_rfecv, y_train)

# test clf on test data

X_test_CV = count_vect.transform(docs_test)
X_test_tfidf = tfidf_transformer.transform(X_test_CV)
X_test_rfecv = rfecv.transform(X_test_tfidf)

y_predicted = clf.predict(X_test_rfecv)

#print the mean accuracy on the given test data and labels

print ("Classifier score is: %s " % rfecv.score(X_test_rfecv,y_test))

三个问题：

1）这是使用交叉验证和RFECV的正确方法吗？我特别想知道我是否有过度装配的风险。

2）我使用上述代码实现RFECV之前和之后模型的准确性几乎相同（大约78-79％），这让我很困惑。我希望通过使用RFECV来提高性能。我可能在这里错过了什么或者可以做些不同的事情来提高模型的性能？

3）您可以推荐我尝试哪些其他功能选择方法？到目前为止，我已尝试过RFE和SelectKBest，但它们都没有给我任何模型精度方面的改进。

Answer 1

回答你的问题：

在RFECV功能选择（因此名称）中内置了交叉验证，因此您不需要为此单步执行额外的交叉验证。但是，由于我了解您正在进行多项测试，因此最好进行整体交叉验证，以确保您不会过度拟合特定的列车测试分组。我想在这里提到2点：
1. 我怀疑代码的行为与您认为的完全相同;）。
```
   # cross validation
   sss = StratifiedShuffleSplit(y, 5, test_size=0.2, random_state=42)
   for train_index, test_index in sss:
       docs_train, docs_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]
   # feature extraction
   count_vect = CountVectorizer(stop_words='English', min_df=3, max_df=0.90, ngram_range=(1,3))
   X_CV = count_vect.fit_transform(docs_train)
```

SciKit使用RFECV学习特征选择和交叉验证

1 个答案:

补充说明：