我已经训练了大约80MB的IMDB电影数据集,我正在对与电影相关的测试数据进行情感分析。 我尝试使用Navie bays和SVM(SK学习)进行情感分析,但仍然无法提高结果的准确性。
我尝试了这些方法 https://colab.research.google.com/drive/1yIMLlYhhl6QOj1wpjW-Zka0cLOYk7Jg7
但是我无法获得超过75%的准确度
#转义HTML字符
tweet = BeautifulSoup(tweet).get_text()
#Special case not handled previously.
tweet = tweet.replace('\x92',"'")
#Removal of hastags/account
tweet = ' '.join(re.sub("(@[A-Za-z0-9]+)|(#[A-Za-z0-9]+)", " ", tweet).split())
#Removal of address
tweet = ' '.join(re.sub("(\w+:\/\/\S+)", " ", tweet).split())
#Removal of Punctuation
tweet = ' '.join(re.sub("[\.\,\!\?\:\;\-\=]", " ", tweet).split())
#Lower case
tweet = tweet.lower()
#CONTRACTIONS source: https://en.wikipedia.org/wiki/Contraction_%28grammar%29
CONTRACTIONS = load_dict_contractions()
tweet = tweet.replace("’","'")
words = tweet.split()
reformed = [CONTRACTIONS[word] if word in CONTRACTIONS else word for word in words]
tweet = " ".join(reformed)
# Standardizing words
tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
#Deal with smileys
#source: https://en.wikipedia.org/wiki/List_of_emoticons
SMILEY = load_dict_smileys()
words = tweet.split()
reformed = [SMILEY[word] if word in SMILEY else word for word in words]
tweet = " ".join(reformed)
#Deal with emojis
tweet = emoji.demojize(tweet)
#Strip accents
tweet= strip_accents(tweet)
tweet = tweet.replace(":"," ")
tweet = ' '.join(tweet.split())
当前结果
precision recall f1-score support
0 0.74 0.75 0.75 814
1 0.77 0.76 0.77 892
micro avg 0.76 0.76 0.76 1706
macro avg 0.76 0.76 0.76 1706
weighted avg 0.76 0.76 0.76 1706
预计最多可增加90%
链接到完整代码 https://drive.google.com/file/d/1dsnv86Rgu-NOXnLY1fbvRfN3Homb541K/view?usp=sharing