微基准

Question

我一直在网上看到很多关于如何不建议使用循环的数据科学家的评论。但是，我最近发现自己处于使用一个有用的情况。我想知道以下过程是否有更好的替代方案（为什么替代方案会更好）：

我需要运行一系列重复测量ANOVA，并且与您在下面看到的可重复示例类似地解决问题。

[我知道有关于运行多个ANOVA模型的其他问题，并且还有其他选项用于这些类型的分析，但是现在我只想听听for循环的使用]

作为一个例子，四个重复测量方差分析模型 - 四个因变量分别在三个时间测量：

set.seed(1976)
code <- seq(1:60)
time <- rep(c(0,1,2), each = 20)
DV1 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2))
DV2 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
DV3 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2))
DV4 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat <- data.frame(code, time, DV1, DV2, DV3, DV4)

outANOVA <- list()

for (i in names(dat)) {
  y <- dat[[i]]
  outANOVA[i] <- summary(aov(y ~ factor(time) + Error(factor(code)), 
                                  data = dat))
}

outANOVA

Answer 1

你可以这样写，它更紧凑：

outANOVA <-
  lapply(dat,function(y)
    summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))

for循环不一定比应用函数慢，但对于很多人来说它们不太容易阅读。这在某种程度上是一种品味问题。

真正的犯罪是在矢量化函数可用时使用for循环。这些向量化函数通常包含用C语言编写的for循环速度快得多（或者调用函数）。

请注意，在这种情况下，我们也可以避免创建全局变量y，并且我们不必初始化列表outANOVA。

另一点，直接来自这篇相关文章：For loops in R and computational speed（由Glen_b回答）：

对于R中的循环并不总是比其他方法慢，例如apply    - 但是有一个巨大的bugbear - 永远不会在循环中生成一个数组

相反，在循环之前使数组为全尺寸，然后填充它们   起来。

在你的情况下，你正在增长outANOVA，对于大循环，它可能会成为问题。

以下是一个简单示例中的microbenchmark个不同方法：

n <- 100000
microbenchmark::microbenchmark(
preallocated_vec  = {x <- vector(length=n); for(i in 1:n) {x[i] <- i^2}},
preallocated_vec2 = {x <- numeric(n); for(i in 1:n) {x[i] <- i^2}},
incremented_vec   = {x <- vector(); for(i in 1:n) {x[i] <- i^2}},
preallocated_list = {x <- vector(mode = "list", length = n); for(i in 1:n) {x[i] <- i^2}},
incremented_list  = {x <- list(); for(i in 1:n) {x[i] <- i^2}},
sapply            = sapply(1:n, function(i) i^2),
lapply            = lapply(1:n, function(i) i^2),
times=20)

# Unit: milliseconds
# expr                     min         lq       mean     median         uq        max neval
# preallocated_vec    9.784237  10.100880  10.686141  10.367717  10.755598  12.839584    20
# preallocated_vec2   9.953877  10.315044  10.979043  10.514266  11.792158  12.789175    20
# incremented_vec    74.511906  79.318298  81.277439  81.640597  83.344403  85.982590    20
# preallocated_list  10.680134  11.197962  12.382082  11.416352  13.528562  18.620355    20
# incremented_list  196.759920 201.418857 212.716685 203.485940 205.441188 393.522857    20
# sapply              6.557739   6.729191   7.244242   7.063643   7.186044   9.098730    20
# lapply              6.019838   6.298750   6.835941   6.571775   6.844650   8.812273    20

Answer 2

对于您的用例，我会说这一点没有实际意义。应用矢量化（并且在此过程中，混淆代码）在这里没有任何好处。

下面是一个例子，我在OP中提出了你的解决方案的microbenchmark::microbenchmark，穆迪在他的帖子中提出的解决方案，以及我的第三个解决方案，甚至更多的矢量化（三重嵌套{{1} }）。

微基准

lapply

结果

set.seed(1976); code = seq(1:60); time = rep(c(0,1,2), each = 20);
DV1 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2)); DV2 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2)); DV3 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2)); DV4 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat = data.frame(code, time, DV1, DV2, DV3, DV4)

library(microbenchmark)

microbenchmark(
    `Peter Miksza` = {
        outANOVA1 = list()
        for (i in names(dat)) {
            y = dat[[i]]
            outANOVA1[i] = summary(aov(y ~ factor(time) + Error(factor(code)), 
                data = dat))
    }},
    Moody_Mudskipper = {
        outANOVA2 =
            lapply(dat,function(y)
                summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))
    },
    `catastrophic_failure` = {
        outANOVA3 = 
            lapply(lapply(lapply(dat, function(y) y ~ factor(time) + Error(factor(code))), aov, data = dat), summary)
    },
    times = 1000L)

摆弄JIT编译，运行#Unit: milliseconds # expr min lq mean median uq max neval cld # Peter Miksza 26.25641 27.63011 31.58110 29.60774 32.81374 136.84448 1000 b # Moody_Mudskipper 22.93190 23.86683 27.20893 25.61352 28.61729 135.58811 1000 a # catastrophic_failure 22.56987 23.57035 26.59955 25.15516 28.25666 68.87781 1000 a和compiler::setCompilerOptions(optimize = 0)以下结果随之而来

compiler::enableJIT(0)

结论

正如Dirk的评论所提到的那样，性能没有差异，但使用矢量化可以大大降低可读性。

在增长列表中

尝试使用穆迪的解决方案，如果结果列表中等长度，似乎增长列表可能是一个坏主意。此外，直接使用字节编译函数可以提供很小的性能改进。两者都是预期的行为。虽然预分配可能足以满足您的申请。

为什么不使用for循环？

2 个答案:

微基准

结果

结论

在增长列表中