我正在训练一个朴素的贝叶斯模型来预测电影评论类型。该模型已经创建并且运行良好,但是我一直困扰着一个问题,即如何根据可能性范围内的条件概率对单词进行排序,从而如何估计每个单词对标签分类的重要性。
任务是训练朴素的贝叶斯模型来预测电影评论的类型,正面或负面。训练模型后,我编写了“ nbclassifier $ table”以显示每个功能的条件概率的列表。然后,我不知道如何为正面评价选择前5个最有特色的功能,为负面评价选择其他前5个最明显的功能。
rm(list = ls(all.names = TRUE))
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
library(quanteda)
library(readtext)
library(pacman)
pacman::p_load(ElemStatLearn,foreign,class,caret,e1071)
url = "http://www.ocf.berkeley.edu/~janastas/data/movie-pang02.csv"
dataframe <- readtext(url, text_field = "text")
## change the column names for own use
colnames(dataframe) <- c("doc_id", "text","posneg_type")
## creating corpus
doc.corpus <- corpus(dataframe)
summary(doc.corpus,5)
#remove punctuation, remove numbers, remove all the spaces, stem the words,
remove all of the stop words, and convert everything into lowercase.
doc.corpus.dfm <- tokens(doc.corpus,
remove_numbers = TRUE,
remove_symbols = TRUE,
remove_url = TRUE,
remove_punct = TRUE,
remove_twitter = TRUE,
remove_separators = TRUE) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_remove(stopwords("english"), padding = TRUE) %>%
tokens_remove( "s", padding = TRUE) %>%
tokens_remove( "t", padding = TRUE) %>%
tokens_ngrams(n=1) %>%
tokens_tolower() %>%
tokens_wordstem() %>%
dfm()
doc.corpus.dfm <- dfm(doc.corpus.dfm)
doc.corpus.dfm.sparse <- dfm_trim(doc.corpus.dfm, sparsity = 0.9)
reviewsDTM_F <- data.matrix(dfm_weight(doc.corpus.dfm.sparse, scheme =
"count"))
## Train a naive Bayes classifier called "nbclassifier" with a 70\%/30\%
training/testing split
# convert to a dataframe
reviewsDTM_F <- as.data.frame(reviewsDTM_F, stringsAsFactors = FALSE)
# set NA values to 0
reviewsDTM_F[is.na(reviewsDTM_F)] <- 0
sum(is.na(reviewsDTM_F))
## Creating training dataset
reviewtrunc <- data.frame(yvar =
factor(doc.corpus.dfm@docvars$posneg_type), reviewsDTM_F)
# Resample the data
train <- sample(1:dim(reviewtrunc)[1]) #Random Indices shuffling helpful
reviewtrunc.train <- reviewtrunc[train[1:1400],]
reviewtrunc.test <- reviewtrunc[train[1401:2000],]
set.seed(20191013)
## Train the "nbclassifier"
nbclassifier <- naiveBayes(yvar ~ ., data = reviewtrunc.train)
## Now predict the test data
test_pred <- predict(nbclassifier, reviewtrunc.test[ ,
names(reviewtrunc.test) != "yvar"])
## Retrieve the probabilities of each term from nbclassifier
nbclassifier$tables
因此,我该如何使用“ naiveBayes”函数中的“表格”数据来提供这些条件概率,提供预测正面评价的前5个术语列表和预测负面评价的前5个术语列表