定义sci-kit管道以考虑未知的测试数据

时间:2017-06-09 15:05:36

标签: python arrays pandas numpy scikit-learn

尝试使用pkl中加载的sci-kit learn模型对未知测试数据进行预测时出现以下错误。

---------------------------------------------------------------------------

/Users/anaconda/lib/python2.7/site-packages/sklearn/linear_model/base.pyc in decision_function(self, X)
    315         if X.shape[1] != n_features:
    316             raise ValueError("X has %d features per sample; expecting %d"
--> 317                              % (X.shape[1], n_features))
    318 
    319         scores = safe_sparse_dot(X, self.coef_.T,

ValueError: X has 6 features per sample; expecting 10000

我已按照其他类似stack overflow帖子的说明操作,例如HereHEREHERE,我已将Pipeline功能实施到确保我使用相同的参数,但是,我仍然收到与为测试数据创建单独vectorizer而不是使用pipeline时相同的错误。 如何使用我以前训练过的并保存的加载模型将我的未知测试数据转换为适当的矩阵? sci-kit learn中是否有一个我在这里没有使用的功能?

以下是我的代码示例:

import pandas as pd
import sys
from sklearn.externals import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# ingests train data

data = pd.read_csv('train.csv', header=0, sep=',', names=['sentiment', 'review'])
X_train = data["review"]
y_train = data["sentiment"]

# trains and saves models with pipeline


text_log_reg = Pipeline([('count_vect', CountVectorizer(analyzer = "word",
                             tokenizer = None,
                             preprocessor = None,
                             stop_words = None,
                             max_features = 10000)),
                    ('tfidf', TfidfTransformer(use_idf=True)),
                    ('log_reg', LogisticRegression()),
                    ])

text_log_reg.fit(X_train, y_train)
results2 = text_log_reg.predict(X_train)

_ = joblib.dump(text_log_reg, 'log_reg-SA-1,pkl', compress=9)

# uses loaded model to make predictions

dataTest = pd.read_csv('test.csv', 
                   header=None, sep=',', names=['review'])

X_test = dataTest["review"]

print '----load classifier model----'
clf = joblib.load('log_reg-SA-1,pkl')

predicted = clf.predict(X_test) 
print predicted

示例数据: train.csv

"id","review","score"
"123","I love movies","1"
"456","I hate movies","0"
"789","That show was great","1"
"012","He makes the plot interesting","1"
"345","The ending made me sleep","0"

test.csv

"id","review"
"678","I loved the plot of that show"
"910","I dislike the main character"

0 个答案:

没有答案