如何在新的TFIDF数据上使用经过训练的SVM模型

时间:2019-04-08 15:09:04

标签: scikit-learn

我使用以下代码训练了SVM

GridSearchCV(svm.SVC(), parameters, cv=3, iid=False, n_jobs=-1,scoring="f1")

关于我这样创建的数据:

def transform_concated_data(df):
    X_train, X_test, y_train, y_test = train_test_split(comments, df['yes_no'],train_size=.85,test_size=.15)

# instantiate vect and transformers
    count_vect = CountVectorizer()
    tfidf_transformer = TfidfTransformer()

# model train
    X_train_counts = count_vect.fit_transform(X_train)
    X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

# model test
    X_test_counts = count_vect.transform(X_test)
    X_test_tfidf = tfidf_transformer.transform(X_test_counts)
    return X_train_tfidf,X_test_tfidf,y_train,y_test

我这样保存模型

with open('/users/josh.flori/drive_backup/drive_backup/python_scripts/reddit_bot_stuff/svm_model.sav', 'wb') as f:
    cPickle.dump(gs_clf, f)

稍后加载它并希望对全新数据进行分类时,我目前使用以下函数来格式化模型的新数据:

def transform_non_concated_data(df):
# real-world test
    count_vect = CountVectorizer()
    tfidf_transformer = TfidfTransformer()
    X_NEW_counts = count_vect.fit_transform(comments)
    X_NEW_tfidf = tfidf_transformer.fit_transform(X_NEW_counts)
    y_new=df['yes_no']
    return X_NEW_tfidf,y_new      

但是您注意到,在第二个函数中,当我创建新的tfidf数据时,与训练数据时相比,我使用了全新的count_vect和tfidf_transformer实例。因此,我输入模型的数据是不同的。我必须使用与我训练相同的count_vec和tfidf_transformers吗?

或者它将按原样工作,当我在新数据上调用gs_clf.predict()时,它将以某种方式寻找正确的数据吗?

编辑,以回答我自己的问题,在训练时,我需要创建一个像这样的词汇表

training_vocab = dict((key, value) for (key, value) in
                 zip(count_vect.get_feature_names(), range(len(count_vect.get_feature_names()))))

然后在加载新数据时,像这样实例化一个新的count_vect

count_vect = CountVectorizer(vocabulary=training_vocab)

0 个答案:

没有答案