我正在研究文本数据结构。
我需要使用以下格式进行预测:
xyz@gmail.com - >电子邮件,印度 - >国家等......
为实现这一目标,正在使用SVC
和OneVsRestClassifier
。如果 train 和 test 子集位于同一个脚本中,则数据外推工作正常。
但是,如果单独评估,预测失败,即训练和测试数据在单独的 Python 脚本中)。
我收到的错误指向稀疏矩阵维度的变体。
帮助我解决这个变化问题。
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
#Tfidf vectorizer for text data
tfidf_enc = TfidfVectorizer(binary=True)
lbl_enc = LabelEncoder()
X = tfidf_enc.fit_transform(name_text)
X = X.astype('float16')
y = lbl_enc.fit_transform(name_label)
clf = SVC(C=100, kernel='rbf', degree=3,
gamma=1, coef0=1, shrinking=True,
probability=True, tol=0.001, cache_size=200,
class_weight=None, verbose=2, max_iter=-1,
decision_function_shape=None, random_state=None)
model = OneVsRestClassifier(clf, n_jobs=4)
model.fit(X,y)
import pickle
# save the model to disk
filename = 'D:/authAff_model.sav'
pickle.dump(model, open(filename, 'wb'))
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
# load the model from disk
filename = 'D:/authAff_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))
#Prediction
test_as_text = ['France','xyz@gmail.com','Singapore']
test_as_text = [item.lower() for item in test_as_text]
tfidf_enc = TfidfVectorizer(binary=True)
X_test = tfidf_enc.fit_transform(test_as_text)
X_test = X_test.astype('float16')
y_test = loaded_model.predict(X_test)
作为单独的脚本进行测试时,会发生以下错误:
ValueError:X.shape [1] = 6应该等于6104,训练时的特征数量
原始维度:
<3x6104 sparse matrix of type '<type 'numpy.float16'>'
with 3 stored elements in Compressed Sparse Row format>