我正在使用sklearn训练机器学习模型来对波斯文字进行情感分析。这是我的代码:
vectorizer = TfidfVectorizer(max_features=1500,
sublinear_tf=True,
use_idf=True,
stop_words=stop_words)
X = vectorizer.fit_transform(data).toarray()
le = LabelEncoder()
le.fit(["pos", "neu", "neg"])
y = le.transform(data_labels)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
classifier_rbf = SVC(kernel='rbf', gamma=1, C=1)
classifier_rbf.fit(X_train, y_train)
y_pred = classifier_rbf.predict(X_test)
with open('svm_rbf_classifier.pkl', 'wb') as fid:
_pickle.dump(y_pred, fid)
with open('tfidf_vectorizer.pkl', 'rb') as fid:
vectorizer = _pickle.load(fid)
print(classification_report(y_test, y_pred))
print()
print(accuracy_score(y_test, y_pred))
在训练和测试阶段之后,我只想加载我的矢量化器和分类器,以逐一预测波斯语注释。我编写了这段代码来实现这一点:
with open('tfidf_vectorizer.pkl', 'rb') as fid:
vectorizer = _pickle.load(fid)
with open('svm_rbf_classifier.pkl', 'rb') as fid:
classifier_rbf = _pickle.load(fid)
comment = 'من نسبت به نتایجی که تیم این روزا کسب میکنه نگرانم'
X = vectorizer.fit_transform([comment]).toarray()
predicted = classifier_rbf.predict(X)
print(predicted)
但是当我尝试它时,出现以下错误:
Traceback (most recent call last):
File "C:/Projects/Sentiment/test.py", line 18, in <module>
predicted = classifier_rbf.predict(X)
File "C:\Python\Python36\lib\site-packages\sklearn\svm\base.py", line 576, in predict
y = super(BaseSVC, self).predict(X)
File "C:\Python\Python36\lib\site-packages\sklearn\svm\base.py", line 325, in predict
X = self._validate_for_predict(X)
File "C:\Python\Python36\lib\site-packages\sklearn\svm\base.py", line 478, in _validate_for_predict
(n_features, self.shape_fit_[1]))
ValueError: X.shape[1] = 8 should be equal to 1500, the number of features at training time
我不明白这一点,因为我使用的是与训练和测试相同的矢量化器。我究竟做错了什么?
答案 0 :(得分:1)
您不应该fit_transform您的注释数据,而只能对其进行转换。更改
X = vectorizer.fit_transform([comment]).toarray()
到
X = vectorizer.transform([comment]).toarray()