如何估计每个单词对朴素贝叶斯分类标签的重要性?

时间:2019-10-13 21:14:01

标签: r r-caret naivebayes

我正在训练一个朴素的贝叶斯模型来预测电影评论类型。该模型已经创建并且运行良好,但是我一直困扰着一个问题,即如何根据可能性范围内的条件概率对单词进行排序,从而如何估计每个单词对标签分类的重要性。

任务是训练朴素的贝叶斯模型来预测电影评论的类型,正面或负面。训练模型后,我编写了“ nbclassifier $ table”以显示每个功能的条件概率的列表。然后,我不知道如何为正面评价选择前5个最有特色的功能,为负面评价选择其他前5个最明显的功能。

rm(list = ls(all.names = TRUE)) 

library(knitr)
knitr::opts_chunk$set(echo = TRUE)

library(quanteda)
library(readtext)

library(pacman)
pacman::p_load(ElemStatLearn,foreign,class,caret,e1071) 

url = "http://www.ocf.berkeley.edu/~janastas/data/movie-pang02.csv"
dataframe <- readtext(url, text_field = "text")

## change the column names for own use
colnames(dataframe) <- c("doc_id", "text","posneg_type")

## creating corpus
doc.corpus <- corpus(dataframe)
summary(doc.corpus,5)

#remove punctuation, remove numbers, remove all the spaces, stem the words, 
remove all of the stop words, and convert everything into lowercase.
doc.corpus.dfm <- tokens(doc.corpus,  
                     remove_numbers = TRUE, 
                     remove_symbols = TRUE, 
                     remove_url = TRUE,
                     remove_punct = TRUE,
                     remove_twitter = TRUE,
                     remove_separators = TRUE) %>%
tokens_remove("\\p{P}", valuetype = "regex", padding = TRUE) %>%
tokens_remove(stopwords("english"), padding  = TRUE) %>%
tokens_remove( "s", padding = TRUE)  %>%
tokens_remove( "t", padding = TRUE)  %>%
tokens_ngrams(n=1) %>%
tokens_tolower() %>%
tokens_wordstem() %>%
dfm()

doc.corpus.dfm <- dfm(doc.corpus.dfm)

doc.corpus.dfm.sparse <- dfm_trim(doc.corpus.dfm, sparsity = 0.9)

reviewsDTM_F <- data.matrix(dfm_weight(doc.corpus.dfm.sparse, scheme = 
"count"))

## Train a naive Bayes classifier called "nbclassifier" with a 70\%/30\% 
training/testing split
# convert to a dataframe
reviewsDTM_F <- as.data.frame(reviewsDTM_F, stringsAsFactors =  FALSE)
# set NA values to 0
reviewsDTM_F[is.na(reviewsDTM_F)] <- 0

sum(is.na(reviewsDTM_F))

## Creating training dataset
reviewtrunc <- data.frame(yvar = 
factor(doc.corpus.dfm@docvars$posneg_type), reviewsDTM_F)

# Resample the data
train <- sample(1:dim(reviewtrunc)[1]) #Random Indices shuffling helpful
reviewtrunc.train <- reviewtrunc[train[1:1400],]
reviewtrunc.test <- reviewtrunc[train[1401:2000],]

set.seed(20191013)

## Train the "nbclassifier"
nbclassifier <- naiveBayes(yvar ~ ., data = reviewtrunc.train)

## Now predict the test data
test_pred <- predict(nbclassifier, reviewtrunc.test[ , 
names(reviewtrunc.test) != "yvar"])

## Retrieve the probabilities of each term from nbclassifier
nbclassifier$tables

因此,我该如何使用“ naiveBayes”函数中的“表格”数据来提供这些条件概率,提供预测正面评价的前5个术语列表和预测负面评价的前5个术语列表

0 个答案:

没有答案