Question

我想使用 python 构建情感分析。我已经完成了模型构建部分，但面临着新数据集的挑战。我已经以pickle格式保存了我的模型。我的原始数据集看起来像（存在大约 2200 万行）

text                                        category
product is good                                low
product is horrible                            high
it's not working properly                      high
quality wise good but still not happy          low

我的新数据集是 -

text
i am a happy customer with this product
product quality is poor
sound is not good
overall its a good product

我的最终输出看起来像 -

text                                                     category
i am a happy customer with this product                     low
product quality is poor                                     high
sound is not good                                           high
overall its a good product                                  low

我的代码在基本清理后看起来像 -

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

nb = Pipeline([('vect', CountVectorizer()),
               ('tfidf', TfidfTransformer()),
               ('clf', MultinomialNB()),
              ])
nb.fit(X_train, y_train)

%%time
from sklearn.metrics import classification_report
y_pred = nb.predict(X_test)

print('accuracy %s' % accuracy_score(y_pred, y_test))
print(classification_report(y_test, y_pred,target_names=my_tags))

import pickle
print ("Model trained. Saving model to text_classification_NB1.pickle")
with open("text_classification_NB1.pickle", "wb") as file:
    pickle.dump(nb, file)
print ("Model saved.")

现在开始新的预测 -

data = pd.read_csv("/Users/email/email_test_22122020.csv",encoding='latin1')
nb.predict(data)

但是给我这个结果-

array(['low'], dtype='<U4')

有什么帮助吗？

Answer 1

您应该应用 stemming 并从训练和测试数据集中删除 stop words 以获得更好的结果。有关详细信息 Removing stop words with NLTK in Python 和 Python | Stemming words with NLTK

Answer 2

您应该尝试对文本进行数据清理技术，例如词形还原和词干提取，以确保可以使用正确的文本进行分析。同时删除停用词以避免不必要的文本。

您可以尝试使用 TextBlob 的 Sentiment Analysis

找出句子的极性

机器学习中的新标签预测

2 个答案: