我在这里遵循了该教程:https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed创建一个Twitter情感分析器,该分析器使用nltk库中的朴素贝叶斯分类器作为将tweet分类为正,负或中性的一种方式,但返回的标签是仅中立或无关紧要。我在下面包含了我的代码,因为我对任何机器学习都不十分了解,因此,我们将不胜感激。
我尝试使用不同的推文集进行分类,即使指定了诸如“ happy”之类的搜索关键字,它仍然会返回“ neutral”。我不b
import nltk
def buildvocab(processedtrainingdata):
all_words = []
for (words, sentiment) in processedtrainingdata:
all_words.extend(words)
wordlist = nltk.FreqDist(all_words)
word_features = wordlist.keys()
return word_features
def extract_features(tweet):
tweet_words = set(tweet)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in tweet_words) #creates json key containing word x, its loc.
# Every key has a T/F according - true for present , false for not
return features
# Building the feature vector
word_features = buildvocab(processedtrainingdata)
training_features = nltk.classify.apply_features(extract_features, processedtrainingdata)
# apply features does the actual extraction
Nbayes_result_labels = [Nbayes.classify(extract_features(tweet[0])) for tweet in processedtestset]
# get the majority vote [?]
if Nbayes_result_labels.count('positive') > Nbayes_result_labels.count('negative'):
print('Positive')
print(str(100*Nbayes_result_labels.count('positive')/len(Nbayes_result_labels)))
elif Nbayes_result_labels.count('negative') > Nbayes_result_labels.count('positive'):
print(str(100*Nbayes_result_labels.count('negative')/len(Nbayes_result_labels)))
print('Negative sentiment')
else:
print('Neutral')
#the output is always something like this:
print(Nbayes_result_labels)
['neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'irrelevant', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral']
答案 0 :(得分:0)
您的数据集高度不平衡。您自己在评论之一中提到了它,您有550条正面和550条负面标签的tweet,但4000条中立的,这就是为什么它总是偏爱多数阶级的原因。如果可能的话,所有课程的发声数应该相等。您还需要了解评估指标,然后您很可能会发现召回不好。理想的模型应在所有评估指标上保持良好的状态。为了避免过分适应某些人,还可以添加第四个“其他”类,但是现在您可以跳过该类。
您可以采取一些措施来提高模型的性能,或者通过添加可能的类似话语(添加更多数据)对少数派类别进行过度采样,或者对多数派进行欠采样,或者结合使用两者。您可以在线阅读有关过采样,欠采样的信息。
在这个新的数据集中,如果可能的话,请尝试使所有类别的语音以该比例1:1:1说话。最后,通过网格搜索,随机搜索或tpot调整超参数,尝试使用其他算法。
编辑:在您的情况下,“其他”类无关紧要,因此您现在有4个类尝试为每个类按1:1:1:1的比例获取数据集。