SciPy和scikit-learn - ValueError:尺寸不匹配

时间:2012-09-18 20:16:33

标签: python numpy scipy scikit-learn

我使用SciPyscikit-learn来训练和应用多项式朴素贝叶斯分类器进行二进制文本分类。准确地说,我使用模块sklearn.feature_extraction.text.CountVectorizer来创建稀疏矩阵,其中包含来自文本的单词特征计数,模块sklearn.naive_bayes.MultinomialNB作为分类器实现,用于训练分类器对训练数据并将其应用于测试数据。

CountVectorizer的输入是一个表示为unicode字符串的文本文档列表。训练数据远大于测试数据。我的代码看起来像这样(简化):

vectorizer = CountVectorizer(**kwargs)

# sparse matrix with training data
X_train = vectorizer.fit_transform(list_of_documents_for_training)

# vector holding target values (=classes, either -1 or 1) for training documents
# this vector has the same number of elements as the list of documents
y_train = numpy.array([1, 1, 1, -1, -1, 1, -1, -1, 1, 1, -1, -1, -1, ...])

# sparse matrix with test data
X_test = vectorizer.fit_transform(list_of_documents_for_testing)

# Training stage of NB classifier
classifier = MultinomialNB()
classifier.fit(X=X_train, y=y_train)

# Prediction of log probabilities on test data
X_log_proba = classifier.predict_log_proba(X_test)

问题:一旦调用MultinomialNB.predict_log_proba(),我就会ValueError: dimension mismatch。根据下面的IPython堆栈跟踪,SciPy中出现错误:

/path/to/my/code.pyc
--> 177         X_log_proba = classifier.predict_log_proba(X_test)

/.../sklearn/naive_bayes.pyc in predict_log_proba(self, X)
    76             in the model, where classes are ordered arithmetically.
    77         """
--> 78         jll = self._joint_log_likelihood(X)
    79         # normalize by P(x) = P(f_1, ..., f_n)
    80         log_prob_x = logsumexp(jll, axis=1)

/.../sklearn/naive_bayes.pyc in _joint_log_likelihood(self, X)
    345         """Calculate the posterior log probability of the samples X"""
    346         X = atleast2d_or_csr(X)
--> 347         return (safe_sparse_dot(X, self.feature_log_prob_.T)
    348                + self.class_log_prior_)
    349 

/.../sklearn/utils/extmath.pyc in safe_sparse_dot(a, b, dense_output)
    71     from scipy import sparse
    72     if sparse.issparse(a) or sparse.issparse(b):
--> 73         ret = a * b
    74         if dense_output and hasattr(ret, "toarray"):
    75             ret = ret.toarray()

/.../scipy/sparse/base.pyc in __mul__(self, other)
    276 
    277             if other.shape[0] != self.shape[1]:
--> 278                 raise ValueError('dimension mismatch')
    279 
    280             result = self._mul_multivector(np.asarray(other))

我不知道为什么会出现这种错误。任何人都可以向我解释并为此问题提供解决方案吗?非常感谢提前!

2 个答案:

答案 0 :(得分:0)

另一个解决方案将使用vector.vocabulary

# after trainning the data
vector = CountVectorizer()
vector.fit(self.x_data)
training_data = vector.transform(self.x_data)
bayes = MultinomialNB()
bayes.fit(training_data, y_data)

# use vector.vocabulary for predict
vector = CountVectorizer(vocabulary=vector.vocabulary)
text_vector = vector.transform(text)
trained_model.predict_prob(text_vector)

答案 1 :(得分:0)

我也有类似情况。 但是,我使用了tf_vec.transform(test)并出现尺寸不匹配错误。

term_docs_train = tf_vec.fit_transform(X_train) 
term_docs_test = tf_vec.transform(X_test) 
clf = MultinomialNB(fit_prior=True) 
clf.fit(term_docs_train, Y_train)
prediction_prob = clf.predict_proba(term_docs_test)

形状检查

term_docs_train.shape : (37, 298)
term_docs_test.shape : (19, 298)

这是错误

Traceback (most recent call last):
  File "C:\Users\yskwak\.conda\envs\tf2.0-gpu\lib\site-packages\IPython\core\interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-6159f2dd98f0>", line 1, in <module>
    prediction_prob = clf.predict_proba(term_docs_test)
  File "C:\Users\yskwak\.conda\envs\tf2.0-gpu\lib\site-packages\sklearn\naive_bayes.py", line 118, in predict_proba
    return np.exp(self.predict_log_proba(X))
  File "C:\Users\yskwak\.conda\envs\tf2.0-gpu\lib\site-packages\sklearn\naive_bayes.py", line 98, in predict_log_proba
    jll = self._joint_log_likelihood(X)
  File "C:\Users\yskwak\.conda\envs\tf2.0-gpu\lib\site-packages\sklearn\naive_bayes.py", line 777, in _joint_log_likelihood
    return (safe_sparse_dot(X, self.feature_log_prob_.T) +
  File "C:\Users\yskwak\.conda\envs\tf2.0-gpu\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f
    return f(**kwargs)
  File "C:\Users\yskwak\.conda\envs\tf2.0-gpu\lib\site-packages\sklearn\utils\extmath.py", line 153, in safe_sparse_dot
    ret = a @ b
  File "C:\Users\yskwak\.conda\envs\tf2.0-gpu\lib\site-packages\scipy\sparse\base.py", line 564, in __matmul__
    return self.__mul__(other)
  File "C:\Users\yskwak\.conda\envs\tf2.0-gpu\lib\site-packages\scipy\sparse\base.py", line 520, in __mul__
    raise ValueError('dimension mismatch')
ValueError: dimension mismatch