Quanteda包,Naive Bayes:我如何预测不同特征的测试数据?

时间:2017-05-23 13:51:20

标签: r naivebayes text-analysis quanteda

我使用quanteda::textmodel_NB创建了一个模型,该文档将文本分类为两个类别之一。我将模型与去年夏天的训练数据集相匹配。

现在,我正试图在今年夏天使用它来分类我们在工作中得到的新文本。我尝试这样做并得到以下错误:

Error in predict.textmodel_NB_fitted(model, test_dfm) : 
feature set in newdata different from that in training set

生成错误can be found here at lines 157 to 165.

的函数中的代码

我认为这是因为训练数据集中的单词与测试数据集中使用的单词完全匹配。但为什么会出现这种错误呢?我觉得好像在现实世界的例子中有用 - 模型应该能够处理包含不同特征的数据集,因为在应用中这可能总是会发生。

所以我的第一个问题是:

1。这个错误是朴素贝叶斯算法的属性吗?或者这是函数作者做出的选择吗?

然后引导我提出第二个问题:

2。我该如何解决这个问题呢?

为了解决第二个问题,我提供了可重现的代码(最后一行生成上面的错误):

library(quanteda)
library(magrittr)
library(data.table)

train_text <- c("Can random effects apply only to categorical variables?",
                "ANOVA expectation identity",
                "Statistical test for significance in ranking positions",
                "Is Fisher Sharp Null Hypothesis testable?",
                "List major reasons for different results from survival analysis among different studies",
                "How do the tenses and aspects in English correspond temporally to one another?",
                "Is there a correct gender-neutral singular pronoun (“his” vs. “her” vs. “their”)?",
                "Are collective nouns always plural, or are certain ones singular?",
                "What’s the rule for using “who” and “whom” correctly?",
                "When is a gerund supposed to be preceded by a possessive adjective/determiner?")

train_class <- factor(c(rep(0,5), rep(1,5)))

train_dfm <- train_text %>% 
  dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))

model <- textmodel_NB(train_dfm, train_class)

test_text <- c("Weighted Linear Regression with Proportional Standard Deviations in R",
               "What do significance tests for adjusted means tell us?",
               "How should I punctuate around quotes?",
               "Should I put a comma before the last item in a list?")

test_dfm <- test_text %>% 
  dfm(tolower=TRUE, stem=TRUE, remove=stopwords("english"))

predict(model, test_dfm)

我唯一想到的就是手动使功能相同(我假设这会为0填写对象中不存在的功能),但这会产生新的错误。上面示例的代码是:

model_features <- model$data$x@Dimnames$features # gets the features of the training data

test_features <- test_dfm@Dimnames$features # gets the features of the test data

all_features <- c(model_features, test_features) %>% # combining the two sets of features...
  subset(!duplicated(.)) # ...and getting rid of duplicate features

model$data$x@Dimnames$features <- test_dfm@Dimnames$features <- all_features # replacing features of model and test_dfm with all_features

predict(model, dfm) # new error?

但是,此代码会生成 new 错误:

Error in if (ncol(object$PcGw) != ncol(newdata)) stop("feature set in newdata different from that in training set") : 
  argument is of length zero

如何将这种天真的贝叶斯模型应用于具有不同功能的新数据集?

2 个答案:

答案 0 :(得分:4)

幸运的是,有一种简单的方法可以做到这一点:您可以在测试数据上使用test_dfm <- dfm_select(test_dfm, train_dfm) predict(model, test_dfm) ## Predicted textmodel of type: Naive Bayes ## ## lp(0) lp(1) Pr(0) Pr(1) Predicted ## text1 -0.6931472 -0.6931472 0.5000 0.5000 0 ## text2 -11.8698712 -13.1879095 0.7889 0.2111 0 ## text3 -4.1484118 -3.6635616 0.3811 0.6189 1 ## text4 -8.0091415 -8.4257356 0.6027 0.3973 0 为训练集提供相同的功能(和功能排序)。就这么简单:

article-container

答案 1 :(得分:1)

截至2018年5月,现在似乎有一个"force = TRUE"选项也将为您完成这项工作:

predict(model, test_dfm, force = TRUE)
# text1 text2 text3 text4 
#    0     0     1     0 
# Levels: 0 1

资料来源:koheiw和kbenoit关于Quanteda Github的讨论- https://github.com/quanteda/quanteda/issues/1329