预计算Countvectorizer

时间:2016-07-21 15:29:16

标签: python python-3.x machine-learning scikit-learn text-classification

这是关于良好做法的更多问题。我有大量标记的文本数据,我正在尝试训练二进制分类器。我想在我的管道中优化超参数,但矢量化步骤需要花费大量时间。在强交叉验证之前对培训集中的文本进行矢量化是否可以接受,这样我就不会一遍又一遍地重复相同的培训文本?

以下是我的代码片段:

tokenizer = RegexpTokenizer(r'\b[a-z|A-Z]{2,}\b')
vect = CountVectorizer(tokenizer = tokenizer.tokenize, ngram_range = (1,2))

X_train_vect = vect.fit_transform(X_train)

pipe = Pipeline([('vect', TfidfTransformer()),
                 ('norm', Normalizer()),
                 ('clf', SGDClassifier())])

skf = StratifiedKFold(y_train, n_folds = 5, shuffle = True)

params = {'vect__norm' : ['l1','l2'],
          'vect__use_idf' : [True, False],
          'clf__loss' : ['hinge', 'log', 'modified_huber'],
          'clf__alpha' : stats.expon(scale = 0.0001)}

gs = RandomizedSearchCV(estimator = pipe,
                        param_distributions = params,
                        n_iter = 50,
                        scoring = 'accuracy',
                        verbose = 3,
                        cv = skf)

gs.fit(X_train_vect, y_train)

我也尝试将CountVectorizer纳入管道,并没有看到太大的区别。有什么想法/建议吗?

0 个答案:

没有答案