Question

我遇到一些令人讨厌的问题，让Naive Bayes分类器与文档术语矩阵一起使用。我确定我犯了一个非常简单的错误，但无法弄清楚它是什么。我的数据来自帐户电子表格。我被要求弄清楚哪些类别（文本格式：主要是部门名称或预算名称）更有可能在慈善机构上花钱，哪些（或者只是）花在私人公司上。他们建议我使用朴素贝叶斯分类器来做到这一点。我有大量的数据来训练一个模型和数十万行来测试模型。我准备好了字符串，用下划线替换了空格，用+表示了/和/，然后将每个类别视为一个术语：所以'酒精和毒瘾'成为：酒精+药物滥用。

一些示例行：

"environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable" -> This row went to a charity
"west_north_west customer+tenancy premises h.r.a._special_maintenance" -> This row went to a private company.

使用this example作为模板，我编写了以下函数来提出我的文档术语矩阵（使用 tm ），用于训练和测试数据。

library(tm)
library(e1071) 

getMatrix <- function(chrVect){
    testsource <- VectorSource(chrVect)
    testcorpus <- Corpus(testsource)
    testcorpus <- tm_map(testcorpus,stripWhitespace)
    testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
    testmatrix <- t(TermDocumentMatrix(testcorpus))
}

trainmatrix <- getMatrix(traindata$cats)
testmatrix <- getMatrix(testdata$cats)

到目前为止，这么好。问题是当我尝试a）应用朴素贝叶斯模型和b）从该模型预测时。使用 klar 包 - 我得到零概率错误，因为许多术语只有一个类别的零实例并且使用laplace术语似乎不能解决这个问题。使用 e1071 ，模型有效，但是当我使用以下方法测试模型时

model <- naiveBayes(as.matrix(trainmatrix),as.factor(traindata$Code))
rs<- predict(model, as.matrix(testdata$cats))

......每个项目预测相同的类别，即使它们应该大致相等。模型中的某些东西显然不起作用。看一下模型$表中的一些术语 - 我可以看到许多私有的值和慈善的零值，反之亦然。我使用as.factor代码。

output:
rs   1  2
  1  0  0
  2 19  17

关于出了什么问题的任何想法？ dtm矩阵不能和naivebayes一起玩吗？我是否错过了准备数据的一步？我完全没有想法。希望这一切都清楚。如果没有，很高兴澄清。任何建议都会非常感激。

Answer 1

我自己已经遇到过这个问题。你已经完成了（据我所知）一切正确，e1071中的朴素贝叶斯实现（以及klar）是错误的。

但是有一个简单快速的解决方法，以便在e1071中实现的Naive Bayes再次工作：您应该将文本向量更改为分类变量，即as.factor。您已使用目标变量traindata$Code完成此操作，但您必须同时为trainmatrix执行此操作，并确保执行testdata。

我无法追踪100％百分比的错误，但是这部分来自e1071的朴素贝叶斯实现（我可能会注意到，klar只是e1071的包装）：

L <- log(object$apriori) + apply(log(sapply(seq_along(attribs),
            function(v) {
                nd <- ndata[attribs[v]]
                ## nd is now a cell, row i, column attribs[v]
                if (is.na(nd) || nd == 0) {
                    rep(1, length(object$apriori))
                } else {
                    prob <- if (isnumeric[attribs[v]]) {
                        ## we select table for attribute
                        msd <- object$tables[[v]]
                        ## if stddev is eqlt eps, assign threshold
                        msd[, 2][msd[, 2] <= eps] <- threshold
                        dnorm(nd, msd[, 1], msd[, 2])
                    } else {
                        object$tables[[v]][, nd]
                    }
                    prob[prob <= eps] <- threshold
                    prob
                }
            })), 1, sum)

你看到有一个if-else条件：如果我们没有数字，就会使用天真的贝叶，因为我们期望它能够工作。如果我们有数字 - 这就是错误 - 这个天真的贝叶斯自动呈现正常分布。如果你的文字中只有0和1，那么dnorm非常糟糕。我假设由于dnorm the prob创建的值非常低。总是被threshold取代，因此具有较高先验因子的变量将始终“赢”。

但是，如果我理解你的问题是正确的，你甚至不需要预测，而是确定哪个部门向谁提供资金的先验因素。然后你要做的就是深入了解你的模型。在你的每个学期的模型中都会出现先验概率，这就是我认为你正在寻找的概率。让我们对您的示例执行此操作以及前面提到的示例：

## i have changed the vectors slightly
first <- "environment+housing strategy+commissioning third_party_payments supporting_ppl_block_gross_chargeable"
second <- "west_north_west customer+tenancy premises h.r.a._special_maintenance"

categories <- c("charity", "private")

library(tm)
library(e1071)

getMatrix <- function(chrVect){
    testsource <- VectorSource(chrVect)
    testcorpus <- Corpus(testsource)
    testcorpus <- tm_map(testcorpus,stripWhitespace)
    testcorpus <- tm_map(testcorpus, removeWords,stopwords("english"))
    ## testmatrix <- t(TermDocumentMatrix(testcorpus))
    ## instead just use DocumentTermMatrix, the assignment is superflous
    return(DocumentTermMatrix(testcorpus))
}

## since you did not supply some more data, I cannot do anything about these lines
## trainmatrix <- getMatrix(traindata$cats)
## testmatrix <- getMatrix(testdata$cats)
## instead only
trainmatrix <- getMatrix(c(first, second))

## I prefer running this instead of as.matrix as i can add categories more easily
traindf <- data.frame(categories, as.data.frame(inspect(trainmatrix)))

## now transform everything to a character vector since factors produce an error
for (cols in names(traindf[-1])) traindf[[cols]] <- factor(traindf[[cols]])
## traindf <- apply(traindf, 2, as.factor) did not result in factors

## check if it's as we wished
str(traindf)

## it is
## let's create a model  (with formula syntax)
model <- naiveBayes(categories~., data=traindf)

## if you look at the output (doubled to see it more clearly)
predict(model, newdata=rbind(traindf[-1], traindf[-1]))

但正如我已经说过的那样，你不需要预测。看一下模型是可以的，例如model$tables$premises将为您提供向私营公司提供资金的可能性：100％。

如果要处理非常大的数据集，则应在模型中指定阈值和eps。当应该提供阈值时，Eps定义限制。例如。 eps = 0和threshold = 0.000001可以使用。

此外，您应该坚持使用术语频率加权。 tf * idv，例如由于天真的贝叶斯中的dnorm，它将无法工作。

希望我能最终获得50点声望：P

Naive Bayes分类器的文档术语矩阵：意外结果R.

1 个答案: