我正在写一个垃圾邮件分类器来学习sklearn,但是我的单词袋有些问题。 我收到此错误:
File ".\classify.py", line 28, in
model.fit(X,y)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\sklearn\naive_bayes.py", line 579, in fit
X, y = check_X_y(X, y, 'csr')
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\sklearn\utils\validation.py", line 573, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\sklearn\utils\validation.py", line 433, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
TypeError: float() argument must be a string or a number, not 'csr_matrix'
源代码:
import numpy as np
from sklearn.externals import joblib
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
spam_emails = joblib.load("spam_emails.pkl")
ham_emails = joblib.load("ham_emails.pkl")
def transform(array):
vectorizer = CountVectorizer()
vectorized = vectorizer.fit_transform(array)
transformer = TfidfTransformer()
transformed = transformer.fit_transform(vectorized)
return transformed
spam_emails = np.asarray(transform(spam_emails))
ham_emails = np.asarray(transform(ham_emails))
X = np.array(spam_emails, ham_emails) # the x values
y = np.array([1,0]) # The 2 labels to try to predict; 1 means spam, 0 means not spam
model = MultinomialNB()
model.fit(X,y)
joblib.dump(model, "mnb_model.pkl")
spam_emails.pkl和ham_emails.pkl文件只是将所有电子邮件包含在一个数组中。