我使用以下代码训练了SVM
GridSearchCV(svm.SVC(), parameters, cv=3, iid=False, n_jobs=-1,scoring="f1")
关于我这样创建的数据:
def transform_concated_data(df):
X_train, X_test, y_train, y_test = train_test_split(comments, df['yes_no'],train_size=.85,test_size=.15)
# instantiate vect and transformers
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
# model train
X_train_counts = count_vect.fit_transform(X_train)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# model test
X_test_counts = count_vect.transform(X_test)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
return X_train_tfidf,X_test_tfidf,y_train,y_test
我这样保存模型
with open('/users/josh.flori/drive_backup/drive_backup/python_scripts/reddit_bot_stuff/svm_model.sav', 'wb') as f:
cPickle.dump(gs_clf, f)
稍后加载它并希望对全新数据进行分类时,我目前使用以下函数来格式化模型的新数据:
def transform_non_concated_data(df):
# real-world test
count_vect = CountVectorizer()
tfidf_transformer = TfidfTransformer()
X_NEW_counts = count_vect.fit_transform(comments)
X_NEW_tfidf = tfidf_transformer.fit_transform(X_NEW_counts)
y_new=df['yes_no']
return X_NEW_tfidf,y_new
但是您注意到,在第二个函数中,当我创建新的tfidf数据时,与训练数据时相比,我使用了全新的count_vect和tfidf_transformer实例。因此,我输入模型的数据是不同的。我必须使用与我训练相同的count_vec和tfidf_transformers吗?
或者它将按原样工作,当我在新数据上调用gs_clf.predict()时,它将以某种方式寻找正确的数据吗?
编辑,以回答我自己的问题,在训练时,我需要创建一个像这样的词汇表
training_vocab = dict((key, value) for (key, value) in
zip(count_vect.get_feature_names(), range(len(count_vect.get_feature_names()))))
然后在加载新数据时,像这样实例化一个新的count_vect
count_vect = CountVectorizer(vocabulary=training_vocab)