Question

我一直在尝试使用sklearn天真贝叶斯（Bayes）做一个垃圾邮件分类器，但是我得到以下输出和错误-

    Traceback (most recent call last):
      File "Spamclassifier.py", line 61, in <module>
        score=clf.score(test_data,test_label)
      File "C:\Users\abc\AppData\Local\Programs\Python\Python37        \lib\site-packages\sklearn\base.py", line 349, in score
        return accuracy_score(y, self.predict(X),   sample_weight=sample_weight)
       File "C:\Users\abc\AppData\Local\Programs\Python\Python37\lib\site-packages\sklearn\naive_bayes.py", line 66, in predict
        jll = self._joint_log_likelihood(X)
      File "C:\Users\abc\AppData\Local\Programs\Python\Python37   \lib\site-packages\sklearn\naive_bayes.py", line 433, in  _joint_log_likelihood
        n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) /
     ValueError: operands could not be broadcast together with   shapes (780,12964) (19419,)

我将附加我的交易数据和测试数据目录的屏幕截图以及其中一条消息帮助我如何解决此错误。这是我的代码

    import os
    import pickle
    from sklearn.naive_bayes import GaussianNB
    from sklearn.feature_extraction.text import TfidfVectorizer
    import numpy as np
    label=[]

    def getd(add):
       all_words=[]
       labell=[]
       email=[os.path.join(add, i) for i in os.listdir(add)]
       for mails in email:
           with open(mails) as m :
               for line in m:
                   all_words.append(line)
                   if 'spmsg' in mails:
                        labell.append(2)
                   else :
                       labell.append(1)
       return all_words, labell

    def check(add):
       all_words=[]
       labelt=[]
       email=[os.path.join(add, i) for i in os.listdir(add)]
       for mails in email:
           with open(mails) as m :
               for i, line in enumerate(m):
                   all_words.append(line)
                   if 'spmsg' in mails:
                       labelt.append(2)
                   else :
                       labelt.append(1)
       return all_words, labelt


    add=input("Enter the address of training directory\n")
    All, label=getd(add);

    vectorizer=TfidfVectorizer(stop_words='english', analyzer='word')
    train_data=vectorizer.fit_transform(All)
    train_data=train_data.toarray()
    clf=GaussianNB()
    clf.fit(train_data,label)

    chec=input("Enter the address of test directory\n")
    test, test_label=check(chec)
    test_vectorizer=TfidfVectorizer(stop_words='english', analyzer='word')
    test_data=test_vectorizer.fit_transform(test)
    test_data=test_data.toarray()
    score=clf.score(test_data,test_label)
    print("Accuracy is "+sccore+"%\n")


    outfile=open('pickled_classfier', 'wb')
    pickle.dump(clf,outfile)
    outfile.close()

这是我的训练数据目录的屏幕截图

这是我的测试数据目录的屏幕截图

这是其中一条消息的屏幕截图

Answer 1

您正在对测试数据使用新的test_vectorizer。错了

使用时：

train_data=vectorizer.fit_transform(All)

vectorizer了解了训练数据中的当前单词并将其存储为词汇表。 train_data形状代表了这一点。原来是：

(n_samples, 19419)

其中19419是由其学习的独特词汇。这些成为GaussianNB的功能。

现在，测试数据将不包含所有这些单词，并且您正在使用新的TfidfVectorizer来存储测试数据。这样新的向量化器（test_vectorizer）会找到不同的单词，从而产生不同的功能：

(780, 12964)

然后，您在此测试数据上使用旧的clf，这会产生错误，因为它是针对具有不同功能的数据进行训练的。

要解决该错误，就像您使用旧的clf计算测试数据的分数一样，您还应该使用旧的vectorizer（用于训练数据）并调用：< / p>

test_data=vectorizer.transform(test)

请注意，我打电话给transform()而不是fit_transform()是因为打电话给fit()会再次忘记我们以前不想使用的培训和词汇。

ValueError：操作数不能与形状一起广播（780,12964）（19419，）

1 个答案: