Question

我正在使用随机森林来预测500个标签，并使用300K数据进行训练/测试。我像这样用joblib保存模型

from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer()
x_count = count_vect.fit_transform(df['text'].astype(str))
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(x_count)

x_train, x_test, y_train, y_test =train_test_split(X_train_tfidf,df.label,test_size=0.25,random_state=0)
classifier=RandomForestClassifier(n_estimators=100,random_state=0,verbose=2,n_jobs=-1)
classifier.fit(x_train,y_train)

vec_clf= Pipeline([('vectorizer', tfidf_transformer), ('pac', classifier)])
classifier.fit(x_train,y_train)
joblib.dump(vec_clf, 'class.pkl', compress=9)

当我像这样加载以测试新数据时

vectorizer_classifier = joblib.load('class.pkl')
y_pred=vectorizer_classifier.predict(x_test_new)

我有随机结果（得分10％）

P.S。我的代码基于这篇文章

Bringing a classifier to production

我应该以某种特殊的方式处理原始数据，还是我如何保存词汇表的问题？应该保存计数矢量化器还是TFIDF转换器？

预先感谢

加载矢量化器+随机森林模型给出随机结果

0 个答案: