naiveBayes在传递1个文件而不是许多文件时给出了意想不到的结果

时间:2014-06-16 17:47:51

标签: r

我有以下代码,效果非常好:

news_classifier <- naiveBayes(news_train, news_raw_train$type)
news_test_pred <- predict(news_classifier, news_test)

在此上下文中,news_test定义如下:

news_test <- DocumentTermMatrix(news_corpus_test, control = list(dictionary = news_dict))
news_test <- apply(news_test, MARGIN = 2, convert_counts)

news_corpus_test包含几百个文档。

我的问题是,我现在只尝试将一个文档传递给predict()方法并让它告诉我它是什么。我按以下步骤进行:

# first, I pass a vector containing the words which I know will be found in
# in my news_dict
corpus <- Corpus(VectorSource(c('life dog women')))

corpus_clean1 <- tm_map(corpus, removeNumbers)
corpus_clean1 <- tm_map(corpus_clean1, removeWords, stopwords('english'))
corpus_clean1 <- tm_map(corpus_clean1, removePunctuation)
corpus_clean1 <- tm_map(corpus_clean1, stripWhitespace)

inspect(corpus_clean1)

检查语料库输出:

> inspect(corpus_clean1)
<<VCorpus (documents: 1, metadata (corpus/indexed): 0/0)>>

[[1]]
<<PlainTextDocument (metadata: 7)>>
life dog women

然后,我像这样创建DTM:

test <- DocumentTermMatrix(corpus_clean1, control = list(dictionary = news_dict))
test <- apply(test, MARGIN = 2, convert_counts)

test

哪个输出:

> test
      after        back        beat        bill      brazil      bridge         can     cantors         car        cave 
         No          No          No          No          No          No          No          No          No          No 
    circles     climate         cup        cuts         dog       egypt       fairy forecasters        game     germany 
         No          No          No          No         Yes          No          No          No          No          No 
        get       group        hits         how        iraq       iraqi     israeli        kill       kills         kim 
         No          No          No          No          No          No          No          No          No          No 
       life        loss        make         man   militants    namibias       obama        open    pakistan      photos 
        Yes          No          No          No          No          No         Yes          No          No          No 
      plane      police   president        rail republicans        says      school      strike         the       trial 
         No          No          No          No          No          No          No          No          No          No 
    ukraine    violence       visit        vote       watch     weather        week       white         who        will 
         No          No          No          No          No          No          No          No          No          No 
        win       woman     workers       world     yearold 
         No          No          No          No          No 
Levels: No Yes

最后,我运行predict()方法:

test_pred <- predict(news_classifier, test)
test_pred

哪个输出:

> test_pred
 [1] neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral
[16] neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral
[31] neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral
[46] neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral neutral
[61] neutral neutral neutral neutral neutral
Levels: negative neutral positive

但是,我并不期待这一切。相反,我希望返回一个值,即中性,正面或负面。

为什么预测方法不像我期望的那样返回?

修改:以下是convert_counts()方法:

# convert > 0 to factor Yes/No
convert_counts <- function(x) {
  x <- ifelse(x > 0, 1, 0)
  x <- factor(x, levels = c(0, 1), labels = c("No", "Yes"))
  return (x)
}

编辑:以下是我正在使用的软件包:

# get text mining package
install.packages("tm")
library(tm)
# get e1071 package for naive bayes classifier
install.packages("e1071")
library(e1071)
# get gmodels
library(gmodels)

1 个答案:

答案 0 :(得分:1)

我不确定这是否是最优雅的解决方案,但我认为这应该有效:

    test <- DocumentTermMatrix(corpus_clean1, control = list(dictionary = news_dict))
    test <- data.frame(as.matrix(test))
    test <- data.frame(lapply(test,convert_counts))