如何使用带有countVectorizer.fit_transform()的pickled分类器来标记数据

时间:2014-09-23 21:02:24

标签: python scikit-learn text-classification

我在一组短文档上训练了一个分类器,并在获得二元分类任务的合理f1和准确度分数后对其进行了腌制。

在培训期间,我使用sciki-learn countVectorizer cv减少了功能的数量:

    cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000) 

然后使用fit_transform()transform()方法获取转换后的列车和测试集:

    transformed_feat_train = numpy.zeros((0,0,))
    transformed_feat_test = numpy.zeros((0,0,))

    transformed_feat_train = cv.fit_transform(trainingTextFeat).toarray()
    transformed_feat_test = cv.transform(testingTextFeat).toarray()

这一切都适用于训练和测试分类器。但是,我不确定如何将fit_transform()transform()与受过训练的分类器的腌制版本一起用于预测看不见的未标记数据的标签。

我正在提取未标记数据的功能,与我在训练/测试分类器时的方式完全相同:

## load the pickled classifier for labeling
pickledClassifier = joblib.load(pickledClassifierFile)

## transform data
cv = CountVectorizer(min_df=1, ngram_range=(1, 3), max_features = 15000)
cv.fit_transform(NOT_SURE)

transformed_Feat_unlabeled = numpy.zeros((0,0,))
transformed_Feat_unlabeled = cv.transform(unlabeled_text_feat).toarray()

## predict label on unseen, unlabeled data
l_predLabel = pickledClassifier.predict(transformed_feat_unlabeled)

错误讯息:

    Traceback (most recent call last):
      File "../clf.py", line 615, in <module>
        if __name__=="__main__": main()
      File "../clf.py", line 579, in main
        cv.fit_transform(pickledClassifierFile)
      File "../sklearn/feature_extraction/text.py", line 780, in fit_transform
        vocabulary, X = self._count_vocab(raw_documents, self.fixed_vocabulary)
      File "../sklearn/feature_extraction/text.py", line 727, in _count_vocab
        raise ValueError("empty vocabulary; perhaps the documents only"
    ValueError: empty vocabulary; perhaps the documents only contain stop words

1 个答案:

答案 0 :(得分:4)

您应该使用相同的矢量化器实例来转换训练和测试数据。您可以通过使用矢量化程序+分类器创建管道,训练训练集上的管道,挑选整个管道来实现。稍后加载酸洗管道并调用预测。

请参阅此相关问题:Bringing a classifier to production