我正在使用scikit中的各种机制 - 学习创建训练数据集的tf-idf表示和由文本特征组成的测试集。两个数据集都经过预处理以使用相同的词汇表,因此功能和功能数量相同。我可以在训练数据上创建模型并评估其在测试数据上的表现。我想知道我是否使用SelectPercentile来减少转换后训练集中的特征数量,如何识别测试集中用于预测的相同特征?
trainDenseData = trainTransformedData.toarray()
testDenseData = testTransformedData.toarray()
if ( useFeatureReduction== True):
reducedTrainData = SelectPercentile(f_regression,percentile=10).fit_transform(trainDenseData,trainYarray)
clf.fit(reducedTrainData, trainYarray)
# apply feature reduction to the test data
答案 0 :(得分:1)
请参阅下面的代码和评论。
import numpy as np
from sklearn.datasets import make_classification
from sklearn import feature_selection
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
sp = feature_selection.SelectPercentile(feature_selection.f_regression, percentile=30)
sp.fit_transform(X[:-1], y[:-1]) #here, training are the first 9 data vectors, and the last one is the test set
idx = np.arange(0, X.shape[1]) #create an index array
features_to_keep = idx[sp.get_support() == True] #get index positions of kept features
x_fs = X[:,features_to_keep] #prune X data vectors
x_test_fs = x_fs[-1] #take your last data vector (the test set) pruned values
print x_test_fs #these are your pruned test set values
答案 1 :(得分:1)
您应该存储SelectPercentile
对象,并将其用于transform
测试数据:
select = SelectPercentile(f_regression,percentile=10)
reducedTrainData = select.fit_transform(trainDenseData,trainYarray)
reducedTestData = select.transform(testDenseData)