尝试使用pkl
中加载的sci-kit learn
模型对未知测试数据进行预测时出现以下错误。
---------------------------------------------------------------------------
/Users/anaconda/lib/python2.7/site-packages/sklearn/linear_model/base.pyc in decision_function(self, X)
315 if X.shape[1] != n_features:
316 raise ValueError("X has %d features per sample; expecting %d"
--> 317 % (X.shape[1], n_features))
318
319 scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 6 features per sample; expecting 10000
我已按照其他类似stack overflow
帖子的说明操作,例如Here,HERE和HERE,我已将Pipeline
功能实施到确保我使用相同的参数,但是,我仍然收到与为测试数据创建单独vectorizer
而不是使用pipeline
时相同的错误。
如何使用我以前训练过的并保存的加载模型将我的未知测试数据转换为适当的矩阵?
sci-kit learn
中是否有一个我在这里没有使用的功能?
以下是我的代码示例:
import pandas as pd
import sys
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
# ingests train data
data = pd.read_csv('train.csv', header=0, sep=',', names=['sentiment', 'review'])
X_train = data["review"]
y_train = data["sentiment"]
# trains and saves models with pipeline
text_log_reg = Pipeline([('count_vect', CountVectorizer(analyzer = "word",
tokenizer = None,
preprocessor = None,
stop_words = None,
max_features = 10000)),
('tfidf', TfidfTransformer(use_idf=True)),
('log_reg', LogisticRegression()),
])
text_log_reg.fit(X_train, y_train)
results2 = text_log_reg.predict(X_train)
_ = joblib.dump(text_log_reg, 'log_reg-SA-1,pkl', compress=9)
# uses loaded model to make predictions
dataTest = pd.read_csv('test.csv',
header=None, sep=',', names=['review'])
X_test = dataTest["review"]
print '----load classifier model----'
clf = joblib.load('log_reg-SA-1,pkl')
predicted = clf.predict(X_test)
print predicted
示例数据:
train.csv
"id","review","score"
"123","I love movies","1"
"456","I hate movies","0"
"789","That show was great","1"
"012","He makes the plot interesting","1"
"345","The ending made me sleep","0"
test.csv
"id","review"
"678","I loved the plot of that show"
"910","I dislike the main character"