我有以下代码,效果非常好:
news_classifier <- naiveBayes(news_train, news_raw_train$type)
news_test_pred <- predict(news_classifier, news_test)
在此上下文中,news_test
定义如下:
news_test <- DocumentTermMatrix(news_corpus_test, control = list(dictionary = news_dict))
news_test <- apply(news_test, MARGIN = 2, convert_counts)
news_corpus_test
包含几百个文档。
我的问题是,我现在只尝试将一个文档传递给predict()
方法并让它告诉我它是什么。我按以下步骤进行:
# first, I pass a vector containing the words which I know will be found in
# in my news_dict
corpus <- Corpus(VectorSource(c('life dog women')))
corpus_clean1 <- tm_map(corpus, removeNumbers)
corpus_clean1 <- tm_map(corpus_clean1, removeWords, stopwords('english'))
corpus_clean1 <- tm_map(corpus_clean1, removePunctuation)
corpus_clean1 <- tm_map(corpus_clean1, stripWhitespace)
inspect(corpus_clean1)
检查语料库输出:
> inspect(corpus_clean1)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>
[[1]]
<<PlainTextDocument (metadata: 7)>>
life dog women
然后,我像这样创建DTM:
test <- DocumentTermMatrix(corpus_clean1, control = list(dictionary = news_dict))
test <- apply(test, MARGIN = 2, convert_counts)
test
哪个输出:
> test
after back beat bill brazil bridge can cantors car cave
No No No No No No No No No No
circles climate cup cuts dog egypt fairy forecasters game germany
No No No No Yes No No No No No
get group hits how iraq iraqi israeli kill kills kim
No No No No No No No No No No
life loss make man militants namibias obama open pakistan photos
Yes No No No No No Yes No No No
plane police president rail republicans says school strike the trial
No No No No No No No No No No
ukraine violence visit vote watch weather week white who will
No No No No No No No No No No
win woman workers world yearold
No No No No No
Levels: No Yes
最后,我运行predict()
方法:
test_pred <- predict(news_classifier, test)
test_pred
哪个输出:
> test_pred
[1] neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral
[16] neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral
[31] neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral
[46] neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral
[61] neutral neutral neutral neutral neutral
Levels: negative neutral positive
但是,我并不期待这一切。相反,我希望返回一个值,即中性,正面或负面。
为什么预测方法不像我期望的那样返回?
修改:以下是convert_counts()
方法:
# convert > 0 to factor Yes/No
convert_counts <- function(x) {
x <- ifelse(x > 0, 1, 0)
x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
return (x)
}
编辑:以下是我正在使用的软件包:
# get text mining package
install.packages("tm")
library(tm)
# get e1071 package for naive bayes classifier
install.packages("e1071")
library(e1071)
# get gmodels
library(gmodels)
答案 0 :(得分:1)
我不确定这是否是最优雅的解决方案,但我认为这应该有效:
test <- DocumentTermMatrix(corpus_clean1, control = list(dictionary = news_dict))
test <- data.frame(as.matrix(test))
test <- data.frame(lapply(test,convert_counts))