Question

我尝试使用Naive Bayes分类器对R中的推文进行情感分析。这些推文已被手动标记为正面或负面。我使用的软件包是：tm，weka，RTextTools，e1071。问题是当我运行代码时，似乎所有推文都被预测为正面，如混淆矩阵所示，所以我认为我的文档术语矩阵有问题，或者我错过了其他所有内容。我认为主要问题可能是非常稀疏的文档术语矩阵，这可能会导致NB分类器出现问题，因为这是我得到的：

<<DocumentTermMatrix (documents: 2289, terms: 8565)>>
Non-/sparse entries: 20052/19585233
Sparsity           : 100%
Maximal term length: 73
Weighting          : term frequency (tf)

我还试图弄清楚如何减少稀疏条目的数量，并将稀疏度百分比从100％降低到50-70％左右。

我使用了来自http://chengjun.github.io/en/2014/04/sentiment-analysis-with-machine-learning-in-R/的示例，并试图将此处显示的解决方案应用到我的问题中。

file <- read.csv("twitter4242_2_1a.csv") #the file has two columns, the first stating
#stating positive or negative, and the second column has the tweets text itself
tweetsCorpus <- Corpus(VectorSource(file[,2])) # selecting the tweets from the 2nd column
tweetsTDM <- DocumentTermMatrix(tweetsCorpus,
    control = list(
    asPlain = TRUE,
    stopwords = TRUE,
    tolower = TRUE,
    removeNumbers = TRUE,
    stemWords = FALSE,
    removePunctuation = TRUE,
    stripWhitespace = TRUE))
    #tokenize = NGramTokenizer)) I'm not sure if the tokenizer should be included or not, but I get the same result regardless.

mat1 <- as.matrix(tweetsTDM) # creating matrix from the tweetsTDM
classifier <- naiveBayes(mat1[1:1500,], as.factor(file[1:1500,1])) # training the NB classifier with the first 1500 rows, with the factor from the first column (positive/negative)
predicted <- predict(classifier, file[1501:2288,]); #predicting the remaining rows in file, based on the classifier model
table (file[1501:2288,1], predicted) # confusion matrix

我得到的结果示例：

    predicted
        negative positive
negative      0       324
positive      0       464

R中朴素贝叶斯的情感分析

0 个答案: