我正在使用随机森林来预测500个标签,并使用300K数据进行训练/测试。我像这样用joblib保存模型
from sklearn.feature_extraction.text import CountVectorizer
count_vect=CountVectorizer()
x_count = count_vect.fit_transform(df['text'].astype(str))
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(x_count)
x_train, x_test, y_train, y_test =train_test_split(X_train_tfidf,df.label,test_size=0.25,random_state=0)
classifier=RandomForestClassifier(n_estimators=100,random_state=0,verbose=2,n_jobs=-1)
classifier.fit(x_train,y_train)
vec_clf= Pipeline([('vectorizer', tfidf_transformer), ('pac', classifier)])
classifier.fit(x_train,y_train)
joblib.dump(vec_clf, 'class.pkl', compress=9)
当我像这样加载以测试新数据时
vectorizer_classifier = joblib.load('class.pkl')
y_pred=vectorizer_classifier.predict(x_test_new)
我有随机结果(得分10%)
P.S。我的代码基于这篇文章
Bringing a classifier to production
我应该以某种特殊的方式处理原始数据,还是我如何保存词汇表的问题?应该保存计数矢量化器还是TFIDF转换器?
预先感谢