我尝试使用Naive Bayes分类器对R中的推文进行情感分析。这些推文已被手动标记为正面或负面。我使用的软件包是:tm,weka,RTextTools,e1071。问题是当我运行代码时,似乎所有推文都被预测为正面,如混淆矩阵所示,所以我认为我的文档术语矩阵有问题,或者我错过了其他所有内容。 我认为主要问题可能是非常稀疏的文档术语矩阵,这可能会导致NB分类器出现问题,因为这是我得到的:
<<DocumentTermMatrix (documents: 2289, terms: 8565)>>
Non-/sparse entries: 20052/19585233
Sparsity : 100%
Maximal term length: 73
Weighting : term frequency (tf)
我还试图弄清楚如何减少稀疏条目的数量,并将稀疏度百分比从100%降低到50-70%左右。
我使用了来自http://chengjun.github.io/en/2014/04/sentiment-analysis-with-machine-learning-in-R/的示例,并试图将此处显示的解决方案应用到我的问题中。
file <- read.csv("twitter4242_2_1a.csv") #the file has two columns, the first stating
#stating positive or negative, and the second column has the tweets text itself
tweetsCorpus <- Corpus(VectorSource(file[,2])) # selecting the tweets from the 2nd column
tweetsTDM <- DocumentTermMatrix(tweetsCorpus,
control = list(
asPlain = TRUE,
stopwords = TRUE,
tolower = TRUE,
removeNumbers = TRUE,
stemWords = FALSE,
removePunctuation = TRUE,
stripWhitespace = TRUE))
#tokenize = NGramTokenizer)) I'm not sure if the tokenizer should be included or not, but I get the same result regardless.
mat1 <- as.matrix(tweetsTDM) # creating matrix from the tweetsTDM
classifier <- naiveBayes(mat1[1:1500,], as.factor(file[1:1500,1])) # training the NB classifier with the first 1500 rows, with the factor from the first column (positive/negative)
predicted <- predict(classifier, file[1501:2288,]); #predicting the remaining rows in file, based on the classifier model
table (file[1501:2288,1], predicted) # confusion matrix
我得到的结果示例:
predicted
negative positive
negative 0 324
positive 0 464