我想使用 python 构建情感分析。我已经完成了模型构建部分,但面临着新数据集的挑战。我已经以pickle格式保存了我的模型。我的原始数据集看起来像(存在大约 2200 万行)
text category
product is good low
product is horrible high
it's not working properly high
quality wise good but still not happy low
我的新数据集是 -
text
i am a happy customer with this product
product quality is poor
sound is not good
overall its a good product
我的最终输出看起来像 -
text category
i am a happy customer with this product low
product quality is poor high
sound is not good high
overall its a good product low
我的代码在基本清理后看起来像 -
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer
nb = Pipeline([('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
nb.fit(X_train, y_train)
%%time
from sklearn.metrics import classification_report
y_pred = nb.predict(X_test)
print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))
import pickle
print ("Model trained. Saving model to text_classification_NB1.pickle")
with open("text_classification_NB1.pickle", "wb") as file:
pickle.dump(nb, file)
print ("Model saved.")
现在开始新的预测 -
data = pd.read_csv("/Users/email/email_test_22122020.csv",encoding='latin1')
nb.predict(data)
但是给我这个结果-
array(['low'], dtype='<U4')
有什么帮助吗?
答案 0 :(得分:0)
您应该应用 stemming
并从训练和测试数据集中删除 stop words
以获得更好的结果。有关详细信息 Removing stop words with NLTK in Python 和 Python | Stemming words with NLTK
答案 1 :(得分:0)
您应该尝试对文本进行数据清理技术,例如词形还原和词干提取,以确保可以使用正确的文本进行分析。同时删除停用词以避免不必要的文本。
您可以尝试使用 TextBlob 的 Sentiment Analysis
找出句子的极性