如果我使用GridSearchCV和管道获得最佳参数,那么无论如何都要保存训练模型,所以将来我可以将整个管道调用到新数据并为其生成预测?例如,我有以下管道,后跟参数的gridsearchcv:
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(SVC(probability=True))),
])
parameters = {
'vect__ngram_range': ((1, 1),(1, 2),(1,3)), # unigrams or bigrams
'clf__estimator__kernel': ('rbf','linear'),
'clf__estimator__C': tuple([10**i for i in range(-10,11)]),
}
grid_search = GridSearchCV(pipeline,parameters,n_jobs=-1,verbose=1)
print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
pprint(parameters)
t0 = time()
#Conduct the grid search
grid_search.fit(X,y)
print("done in %0.3fs" % (time() - t0))
print()
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
#Obtain the top performing parameters
best_parameters = grid_search.best_estimator_.get_params()
#Print the results
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
现在我想将所有这些步骤保存到一个流程中,以便我可以将它应用于一个新的,看不见的数据集,它将使用相同的参数,矢量化器和变换器来转换,实现和报告结果?
答案 0 :(得分:7)
您只需挑选GridSearchCV
对象进行保存,然后在想要使用它来预测新数据时将其解开。
import pickle
# Fit model and pickle fitted model
grid_search.fit(X,y)
with open('/model/path/model_pickle_file', "w") as fp:
pickle.dump(grid_search, fp)
# Load model from file
with open('/model/path/model_pickle_file', "r") as fp:
grid_search_load = pickle.load(fp)
# Predict new data with model loaded from disk
y_new = grid_search_load.best_estimator_.predict(X_new)