我需要帮助对看不见的数据进行分类。我有一组样本数据。
ID Comment Category
2017_01 inadequate stock Availability
2017_02 Too many failures Quality
2017_03 no documentation Customer Service
2017_04 good product Satisfied
2017_05 long delivery times Delivery
我使用这些数据训练了一个多级文本分类器。 我使用MultinomialNB和SVM测试了数据的拟合,我选择了SVM作为最终模型
# Support Vector Machines - calculating the SVM Fit
from sklearn.linear_model import SGDClassifier
text_clf_svm = Pipeline([('vect', CountVectorizer()), ('tfidf',
TfidfTransformer()),('clf-svm', SGDClassifier(loss='hinge',
penalty='l2',alpha=1e-3, n_iter=5, random_state=42))])
text_clf_svm = text_clf_svm.fit(X_train, y_train)
predicted_svm = text_clf_svm.predict(X_train)
np.mean(predicted_svm == y_train)
0.8850102669404517
我在今年的评论中对模型进行了测试
print(text_clf_svm.predict(["This is obsolete and being replaced by another product. not very robust and we have had many failures"]))
['质量']
问题:如何从2018年(下方)传递看不见的数据,以便按上述方式进行分类?
ID Comment Category
2018_01 This product is obsolete
2018_02 Tech Support takes too long
2018_03 2 out of 3 products failed
2018_04 Delivery to APAC takes too long
答案 0 :(得分:0)
我刚刚想出了解决方案。如果我想将预测结果映射回原始数据框,我必须将新列添加到旧数据框
#Bring in new data and predict
import pandas as pd
df_p = pd.read_csv("comment_to_predict.csv", sep = ',', usecols = range(2), encoding='iso-8859-1')
#Map the predictions to the original data frame (df_p in my case)
df_p['category'] = text_clf_svm.predict(df_p['comment'])