处理稀疏矩阵的变化

时间:2017-08-17 12:29:23

标签: python python-2.7 machine-learning scikit-learn text-classification

我正在研究文本数据结构。

我需要使用以下格式进行预测:

  

xyz@gmail.com - >电子邮件,印度 - >国家等......

为实现这一目标,正在使用SVCOneVsRestClassifier。如果 train test 子集位于同一个脚本中,则数据外推工作正常。

但是,如果单独评估,预测失败,即训练测试数据在单独的 Python 脚本中)。

我收到的错误指向稀疏矩阵维度的变体

帮助我解决这个变化问题。

示例培训师模块

import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier

#Tfidf vectorizer for text data

tfidf_enc = TfidfVectorizer(binary=True)
lbl_enc = LabelEncoder()

X = tfidf_enc.fit_transform(name_text)
X = X.astype('float16')

y = lbl_enc.fit_transform(name_label)
clf = SVC(C=100, kernel='rbf', degree=3,
          gamma=1, coef0=1, shrinking=True, 
          probability=True, tol=0.001, cache_size=200,
          class_weight=None, verbose=2, max_iter=-1,
          decision_function_shape=None, random_state=None) 
model = OneVsRestClassifier(clf, n_jobs=4)
model.fit(X,y)

import pickle
# save the model to disk
filename = 'D:/authAff_model.sav'
pickle.dump(model, open(filename, 'wb'))

样本测试模块

import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
import pickle
# load the model from disk
filename = 'D:/authAff_model.sav'
loaded_model = pickle.load(open(filename, 'rb'))

#Prediction
test_as_text = ['France','xyz@gmail.com','Singapore']
test_as_text = [item.lower() for item in test_as_text]

tfidf_enc = TfidfVectorizer(binary=True)

X_test = tfidf_enc.fit_transform(test_as_text)
X_test = X_test.astype('float16')
y_test = loaded_model.predict(X_test)

错误消息

作为单独的脚本进行测试时,会发生以下错误:

  

ValueError:X.shape [1] = 6应该等于6104,训练时的特征数量

原始维度:

<3x6104 sparse matrix of type '<type 'numpy.float16'>'
    with 3 stored elements in Compressed Sparse Row format>

0 个答案:

没有答案