对于机器学习实验我必须执行特征选择。由于10倍交叉验证,我没有训练和测试集的分工。有人告诉我,我必须每次折叠进行功能选择。但我不知道该怎么做。这是我的代码的一部分。
vec = DictVectorizer()
X = vec.fit_transform(instances) # No train/ test set, because we'll use 10-fold cross validation
scaler = StandardScaler(with_mean=False)
X_scaled = scaler.fit_transform(X) # To make sure everything is on the same scale
enc = LabelEncoder()
y = enc.fit_transform(labels)
#feature selection
from sklearn.feature_selection import SelectKBest, mutual_info_classif
feat_sel = SelectKBest(mutual_info_classif, k=200)
X_fs = feat_sel.fit_transform(X_scaled, y)
#train a classifier
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
y_pred = model_selection.cross_val_predict(clf, X_fs, y, cv=10)
有人可以帮我看每次选择吗?
答案 0 :(得分:1)
您可以使用Pipeline
,将功能选择器和分类器加入管道并交叉验证管道。
参考:http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
答案 1 :(得分:1)
回答您发布的第二个问题。
您可以使用交叉验证并查看结果:
执行:
from sklearn.feature_selection import SelectKBest, mutual_info_classif, RFECV
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np
feat_sel = SelectKBest(mutual_info_classif, k=200)
clf = MultinomialNB()
pipe = Pipeline([('mutual_info',feat_sel), ('naive_bayes',clf)])
scores = cross_val_score(pipe, X_scaled, y, cv =10, scoring = 'accuracy')
print(np.mean(scores))