Question

我正在尝试从naiveBayes包R运行e1071分类器。我遇到的问题是，预测所花费的时间比训练所花费的时间长了约300倍。

我想知道是否有其他人观察过这种行为，如果有的话，是否有任何关于如何改进它的建议。

此问题仅在某些情况下出现。下面，我有一些代码可以训练和预测Iris数据集上的NB分类器。在这里，训练和预测时间非常接近（预测需要长10倍而不是300倍）。我在网上找到的这个问题的唯一其他痕迹是here。在这种情况下，答案是确保将分类变量格式化为因子。我已经这样做了，但仍然没有看到任何改进。

我使用了样本大小N，问题似乎随着N的减少而减少。也许这是算法的预期行为？将N减少10倍会导致预测速度降低150倍，但增加10倍会产生类似的300倍减速。这些数字对我来说似乎很疯狂，特别是因为我过去在数据集中使用过这个算法，有大约300,000个例子，并且发现它非常快。有些东西看似腥，但我无法弄清楚是什么。

我在Linux上使用R版本3.3.1。 e1071包是最新的（2015年发布）。

以下代码应该可以在任何机器上重现。仅供参考我的机器将Iris分类定时为0.003s，Iris预测为0.032s，模拟数据分类为0.045s，结果预测为15.205s。如果你得到的数字不同于这些，请告诉我，因为它可能是我本地机器上的一些问题。

# Remove everything from the environment and clear out memory
rm(list = ls())
gc()

# Load required packages and datasets
require(e1071)
data(iris)

# Custom function: tic/toc function to time the execution
tic <- function(gcFirst = TRUE, type=c("elapsed", "user.self", "sys.self"))
{
  type <- match.arg(type)
  assign(".type", type, envir=baseenv())
  if(gcFirst) gc(FALSE)
  tic <- proc.time()[type]         
  assign(".tic", tic, envir=baseenv())
  invisible(tic)
}

toc <- function()
{
  type <- get(".type", envir=baseenv())
  toc <- proc.time()[type]
  tic <- get(".tic", envir=baseenv())
  print(toc - tic)
  invisible(toc)
}

# set seed for reproducibility
set.seed(12345)

#---------------------------------
# 1. Naive Bayes on Iris data
#---------------------------------
tic()
model.nb.iris <- naiveBayes(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
toc()
tic()
pred.nb.iris <- predict(model.nb.iris, iris, type="raw")
toc()

#---------------------------------
# 2. Simulate data and reproduce NB error
#---------------------------------
# Hyperparameters
L <- 5   # no. of locations
N <- 1e4*L

# Data
married        <- 1*(runif(N,0.0,1.0)>.45)
kids           <- 1*(runif(N,0.0,1.0)<.22)
birthloc       <- sample(1:L,N,TRUE)
major          <- 1*(runif(N,0.0,1.0)>.4)
exper          <- 15+4*rnorm(N)
exper[exper<0] <- 0
migShifter     <- 2*runif(N,0.0,1.0)-1
occShifter     <- 2*runif(N,0.0,1.0)-1
X <- data.frame(rep.int(1,N),birthloc,migShifter,occShifter,major,married,kids,exper,exper^2,exper^3)
colnames(X)[1] <- "constant"
rm(married)
rm(kids)
rm(birthloc)
rm(major)
rm(exper)
rm(occShifter)

# Parameters and errors
Gamma <- 15*matrix(runif(7*L), nrow=7, ncol=L)
eps <- matrix(rnorm(N*L, 0, 1), nrow=N, ncol=L)

# Deterministic portion of probabilities
u <- matrix(rep.int(0,N*L), nrow=N, ncol=L)
for (l in 1:L) {
    u[ ,l] = (X$birthloc==l)*Gamma[1,l] +
    X$major*Gamma[2,l]         + X$married*Gamma[3,l]              
    X$kids*Gamma[4,l]          + X$exper*Gamma[5,l]              
    X$occShifter*Gamma[6,l]    + X$migShifter*X$married*Gamma[7,l]
    eps[ ,l]
}

choice <- apply(u, 1, which.max)

# Add choice to data frame
dat <- cbind(choice,X)

# factorize categorical variables for estimation
dat$major      <- as.factor(dat$major)
dat$married    <- as.factor(dat$married)
dat$kids       <- as.factor(dat$kids)
dat$birthloc   <- as.factor(dat$birthloc)
dat$choice     <- as.factor(dat$choice)

tic()
model.nb <- naiveBayes(choice~birthloc+major+married+kids+exper+occShifter+migShifter,data=dat,laplace=3)
toc()
tic()
pred.nb <- predict(model.nb, dat, type="raw")
toc()

Answer 1

我遇到了同样的问题。我需要运行朴素贝叶斯并在一些大型矩阵（10000行，1000-2000 cols）上预测很多次（1000次）。由于有一段时间，我决定实施自己的朴素贝叶斯实现，以使其速度更快：

https://cran.r-project.org/web/packages/fastNaiveBayes/index.html

我对此做了一些工作，并由此创建了一个软件包：https://cran.r-project.org/web/packages/fastNaiveBayes/index.html。使用伯努利事件模型，它现在快了约330倍。而且，它实现了多项式事件模型（甚至更快一点）和高斯模型（稍微快一点）。最后，是一个混合模型，可以在不同的列中使用不同的事件模型并将它们组合！

e1071的预测功能如此之慢的原因是，它们实际上使用了double for循环。从2017年初开始已经存在一个拉动请求，至少对其中之一进行了向量化，但尚未被接受。

e1071套餐：naiveBayes预测很慢

1 个答案: