这是关于良好做法的更多问题。我有大量标记的文本数据,我正在尝试训练二进制分类器。我想在我的管道中优化超参数,但矢量化步骤需要花费大量时间。在>强交叉验证之前对培训集中的文本进行矢量化是否可以接受,这样我就不会一遍又一遍地重复相同的培训文本?
以下是我的代码片段:
tokenizer = RegexpTokenizer(r'\b[a-z|A-Z]{2,}\b')
vect = CountVectorizer(tokenizer = tokenizer.tokenize, ngram_range = (1,2))
X_train_vect = vect.fit_transform(X_train)
pipe = Pipeline([('vect', TfidfTransformer()),
('norm', Normalizer()),
('clf', SGDClassifier())])
skf = StratifiedKFold(y_train, n_folds = 5, shuffle = True)
params = {'vect__norm' : ['l1','l2'],
'vect__use_idf' : [True, False],
'clf__loss' : ['hinge', 'log', 'modified_huber'],
'clf__alpha' : stats.expon(scale = 0.0001)}
gs = RandomizedSearchCV(estimator = pipe,
param_distributions = params,
n_iter = 50,
scoring = 'accuracy',
verbose = 3,
cv = skf)
gs.fit(X_train_vect, y_train)
我也尝试将CountVectorizer纳入管道,并没有看到太大的区别。有什么想法/建议吗?