保存和使用TFIDF矢量器以用于将来的示例,然后导致尺寸错误

时间:2014-05-27 21:14:04

标签: python scikit-learn

所以我正在训练Skilearn的Multinomial朴素贝叶斯分类器。实际上我现在可以使用from sklearn.externals import joblib保存该分类器。

我现在想制作一个脚本来分类新的例子。我唯一的问题是获取新数据,将其作为字符串并将其传递到classifier.predict( ... ),要求数据采用矢量化形式。

在创建矢量图之前,请按照以下步骤进行操作:

vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2),  stop_words='english', strip_accents='unicode', norm='l2',decode_error="ignore")

现在TFIDF的工作方式是向量化,它需要许多文档。但是通过创建一个新的矢量化器,我不能只传递一个数据结构然后对其进行分类。我显然需要保存这个矢量化器。

真的是如何将数据转换为我训练分类器的相同形式!?

我是否正确使用转换vectorizer.transform(X_test_title)

编辑:

似乎我在上面的评论中是对的。但是,当现在将分类器和矢量化器加载到我的脚本中时,我似乎将向量化数据传递给分类器时出现问题。这是我的函数,其标题和文档都是干净的字符串:

def predict_function(title_data, document_data):
    data =  ((title + ' ') * number_repeat_title(title_data, document_data)) + document_data
    # requires a list
    data = [data, 'testing another element works']
    print data
    data_vector = vectorizer.transform(data)
    print data_vector # checking data is good!
    predicted = classifier.predict(data_vector) 
    return predicted

调用此函数的示例如下:

predict_function('mr sponge bob square pants', 'SpongeBob SquarePants is an American animated television series created by marine biologist and animator Stephen Hillenburg for Nickelodeon. The series chronicles the adventures and endeavors of the title character and his various friends in the fictional underwater city of Bikini Bottom. The series' popularity has made it a media franchise, as well as Nickelodeon network's highest rated show, and the most distributed property of MTV Networks. The media franchise has generated $8 billion in merchandising revenue for Nickelodeon.')

我收到错误,我预测:

predicted = classifier.predict(data_vector) 

...给予

/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/naive_bayes.pyc in predict(self, X)
     61             Predicted target values for X
     62         """
---> 63         jll = self._joint_log_likelihood(X)
     64         return self.classes_[np.argmax(jll, axis=1)]
     65 

/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/naive_bayes.pyc in _joint_log_likelihood(self, X)
    455         """Calculate the posterior log probability of the samples X"""
    456         X = atleast2d_or_csr(X)
--> 457         return (safe_sparse_dot(X, self.feature_log_prob_.T)
    458                 + self.class_log_prior_)
    459 

/Library/Python/2.7/site-packages/scikit_learn-0.15_git-py2.7-macosx-10.9-intel.egg/sklearn/utils/extmath.pyc in safe_sparse_dot(a, b, dense_output)
    189     from scipy import sparse
    190     if sparse.issparse(a) or sparse.issparse(b):
--> 191         ret = a * b
    192         if dense_output and hasattr(ret, "toarray"):
    193             ret = ret.toarray()

/Library/Python/2.7/site-packages/scipy-0.14.0.dev_572aaf0-py2.7-macosx-10.9-intel.egg/scipy/sparse/base.pyc in __mul__(self, other)
    337 
    338             if other.shape[0] != self.shape[1]:
--> 339                 raise ValueError('dimension mismatch')
    340 
    341             result = self._mul_multivector(np.asarray(other))

ValueError: dimension mismatch

1 个答案:

答案 0 :(得分:2)

查看此处找到的scikit-learn文档(http://scikit-learn.org/stable/auto_examples/document_classification_20newsgroups.html) 我相信你是对的。

scikit-learn示例中的训练数据按照以下方式进行了矢量化:

vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
                             stop_words='english')
X_train = vectorizer.fit_transform(data_train.data)

这意味着矢量化器现在将记住TFxIDF权重。

然后使用以下代码行将这些权重应用于测试数据:

X_test = vectorizer.transform(data_test.data)