Question

我正在研究文本数据结构。

我需要使用以下格式进行预测：

xyz@gmail.com - ＆gt;电子邮件，印度 - ＆gt;国家等......

为实现这一目标，正在使用SVC和OneVsRestClassifier。如果 train 和 test 子集位于同一个脚本中，则数据外推工作正常。

但是，如果单独评估，预测失败，即训练和测试数据在单独的 Python 脚本中）。

我收到的错误指向稀疏矩阵维度的变体。

帮助我解决这个变化问题。

示例培训师模块

import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier

#Tfidf vectorizer for text data

tfidf_enc = TfidfVectorizer(binary=True)
lbl_enc = LabelEncoder()

X = tfidf_enc.fit_transform(name_text)
X = X.astype('float16')

y = lbl_enc.fit_transform(name_label)
clf = SVC(C=100, kernel='rbf', degree=3,
          gamma=1, coef0=1, shrinking=True, 
          probability=True, tol=0.001, cache_size=200,
          class_weight=None, verbose=2, max_iter=-1,
          decision_function_shape=None, random_state=None) 
model = OneVsRestClassifier(clf, n_jobs=4)
model.fit(X,y)

import pickle
# save the model to disk
filename = 'D:/authAff_model.sav'
pickle.dump(model, open(filename, 'wb'))

样本测试模块

import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
# load the model from disk
filename = 'D:/authAff_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))

#Prediction
test_as_text = ['France','xyz@gmail.com','Singapore']
test_as_text = [item.lower() for item in test_as_text]

tfidf_enc = TfidfVectorizer(binary=True)

X_test = tfidf_enc.fit_transform(test_as_text)
X_test = X_test.astype('float16')
y_test = loaded_model.predict(X_test)

错误消息

作为单独的脚本进行测试时，会发生以下错误：

ValueError：X.shape [1] = 6应该等于6104，训练时的特征数量

原始维度：

<3x6104 sparse matrix of type '<type 'numpy.float16'>'
    with 3 stored elements in Compressed Sparse Row format>

处理稀疏矩阵的变化

示例培训师模块

样本测试模块

错误消息

0 个答案: