Question

试图编写一个for循环函数来确定第34列的房费高于第23列的住宿费的学校数量。

numrows <- dim(schools)[1]
for(ii in 1:numrows){ 
  if(schools[ii, 34] > schools[ii, 23], na.rm = TRUE){
    nrow(numrows)
  }
}

我遇到以下错误

Error in if (schools[ii, 34] > schools[ii, 23]) { : 
  missing value where TRUE/FALSE needed

我确实注意到一些董事会费用丢失了，我想在比较中忽略这些费用。另外，我只希望满足条件的行数。

Answer 1

为进一步说明我的观点，这是一个基于10,000行样本data.frame

的简单示例

set.seed(2018)
df <- data.frame(one = runif(10^4), two = runif(10^4))

运行microbenchmark分析

library(microbenchmark)
res <- microbenchmark(
    vectorised = sum(df[, 1] > df[, 2]),
    for_loop = {
        ss <- 0
        for (i in seq_len(nrow(df))) if (df[i, 1] > df[i, 2]) ss <- ss + 1
        ss
    })

res
#    Unit: microseconds
#       expr        min        lq         mean      median          uq
# vectorised     59.681     65.13     78.33118     72.8305     77.9195
#   for_loop 346250.957 359535.08 398508.54996 379421.2305 426452.4265
#        max neval
#    152.172   100
# 608490.869   100

library(ggplot2)
autoplot(res)

请注意for循环和矢量化操作之间的四个数量级（!!!）差异（即10,000！）。既不奇怪也不有趣。

Answer 2

导致错误的数据结构

Error in if (schools[ii, 34] > schools[ii, 23]) { : 
  missing value where TRUE/FALSE needed

当比较中的一个或两个值均为NA时发生

，因为NA通过比较x > y传播，例如，

> test = 1 > NA
> test
[1] NA

并且流控制if (test) {}无法确定测试是TRUE（因此应该执行代码）还是FALSE

> if (test) {}
Error in if (test) { : missing value where TRUE/FALSE needed

简单的矢量化解决方案还不够好

> set.seed(123)
> n = 10; x = sample(n); y = sample(n); y[5] = NA
> sum(x > y)
[1] NA

尽管“修复”是显而易见的且便宜的

> sum(x > y, na.rm = TRUE)
[1] 3

for循环也失败了，但是不可能（作为原始问题的一部分）简单地在{if1语句中添加一个na.rm = TRUE子句

s = 0
for (i in seq_along(x)) {
    if (x[i] > y[i], na.rm = TRUE)
        s <- s + 1
}
s

因为这在语法上无效

Error: unexpected ',' in:
"for (i in seq_along(x)) {
    if (x[i] > y[i],"

因此需要找到一个更具创造性的解决方案，例如测试比较的值是否实际上为TRUE

s <- 0
for (i in seq_along(x)) {
    if (isTRUE(x[i] > y[i]))
        s <- s + 1
}
s

当然，比较不正确代码的性能是没有用的。首先需要编写正确的解决方案

f1 <- function(x, y)
    sum(x > y, na.rm = TRUE)
f2 <- function(x, y) {
    s <- 0
    for (i in seq_along(x))
        if (isTRUE(x[i] > y[i]))
            s <- s + 1
    s
}

与f1()相比，

f2()似乎更紧凑和可读，但是我们需要确保结果合理

> x > y
 [1] FALSE  TRUE FALSE FALSE    NA  TRUE FALSE FALSE FALSE  TRUE
> f1(x, y)
[1] 3

相同

> identical(f1(x, y), f2(x, y))
[1] FALSE

嘿，这是怎么回事？他们看起来一样吗？

> f2(x, y)
[1] 3

实际上，结果在数值上是相等的，但是f1()返回一个整数值，而f2()返回一个数字

> all.equal(f1(x, y), f2(x, y))
[1] TRUE
> class(f1(x, y))
[1] "integer"
> class(f2(x, y))
[1] "numeric"

如果我们要比较性能，我们真的需要结果是相同的-比较苹果和橙子没有意义。我们可以通过确保总和f2()始终是整数来更新s以返回整数-使用后缀L（例如0L）来创建整数值

> class(0)
[1] "numeric"
> class(0L)
[1] "integer"

并确保在每次成功的迭代中将整数1L添加到s

f2a <- function(x, y) {
    s <- 0L
    for (i in seq_along(x))
        if (isTRUE(x[i] > y[i]))
            s <- s + 1L
    s
}

然后我们有

> f2a(x, y)
[1] 3
> identical(f1(x, y), f2a(x, y))
[1] TRUE

现在可以比较效果了

> microbenchmark(f1(x, y), f2a(x, y))
Unit: microseconds
      expr    min      lq     mean median      uq    max neval
  f1(x, y)  1.740  1.8965  2.05500  2.023  2.0975  6.741   100
 f2a(x, y) 17.505 18.2300 18.67314 18.487 18.7440 34.193   100

当然f2a()的速度要慢得多，但是对于这个大小问题，由于单位是“微秒”，也许这无关紧要-解决方案如何随着问题大小而扩展？

> set.seed(123)
> x = lapply(10^(3:7), sample)
> y = lapply(10^(3:7), sample)
> f = f1; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]))
Unit: microseconds
              expr     min      lq      mean   median       uq      max neval
 f(x[[1]], y[[1]])   9.655   9.976  10.63951  10.3250  11.1695   17.098   100
 f(x[[2]], y[[2]])  76.722  78.239  80.24091  78.9345  79.7495  125.589   100
 f(x[[3]], y[[3]]) 764.034 895.075 914.83722 908.4700 922.9735 1106.027   100
> f = f2a; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]))
Unit: milliseconds
              expr        min         lq       mean     median         uq
 f(x[[1]], y[[1]])   1.260307   1.296196   1.417762   1.338847   1.393495
 f(x[[2]], y[[2]])  12.686183  13.167982  14.067785  13.923531  14.666305
 f(x[[3]], y[[3]]) 133.639508 138.845753 144.152542 143.349102 146.913338
        max neval
   3.345009   100
  17.713220   100
 165.990545   100

它们都线性缩放（不足为奇），但是即使长度为100000 f2a()也不算太糟-1/6秒-并且在向量化混淆的情况下可能是候选者代码而不是澄清代码。从数据框的各个列中提取单个元素的成本不仅改变了这种计算方式，而且还指出了对原子矢量而非复杂数据结构进行操作的价值。

对于值得的东西，可以想到更差的实现，尤其是

f3 <- function(x, y) {
    s <- logical(0)
    for (i in seq_along(x))
        s <- c(s, isTRUE(x[i] > y[i]))
    sum(s)
}

可二次缩放

> f = f3; microbenchmark(f(x[[1]], y[[1]]), f(x[[2]], y[[2]]), f(x[[3]], y[[3]]), times = 1)
Unit: milliseconds
              expr          min           lq         mean       median
 f(x[[1]], y[[1]])     7.018899     7.018899     7.018899     7.018899
 f(x[[2]], y[[2]])   371.248504   371.248504   371.248504   371.248504
 f(x[[3]], y[[3]]) 42528.280139 42528.280139 42528.280139 42528.280139
           uq          max neval
     7.018899     7.018899     1
   371.248504   371.248504     1
 42528.280139 42528.280139     1

（因为c(s, ...)复制了s的全部元素，并向其中添加了一个元素），这是人们代码中经常发现的一种模式。

第二个常见的减速是使用复杂的数据结构（例如data.frame），而不是简单的数据结构（例如原子向量），例如比较

f4 <- function(df) {
    s <- 0L
    x <- df[[1]]
    y <- df[[2]]
    for (i in seq_len(nrow(df))) {
        if (isTRUE(x[i] > y[i]))
            s <- s + 1L
    }
    s
}

f5 <- function(df) {
    s <- 0L
    for (i in seq_len(nrow(df))) {
        if (isTRUE(df[i, 1] > df[i, 2]))
            s <- s + 1L
    }
    s
}

使用

> df <- Map(data.frame, x, y)
> identical(f1(x[[1]], y[[1]]), f4(df[[1]]))
[1] TRUE
> identical(f1(x[[1]], y[[1]]), f5(df[[1]]))
[1] TRUE
> microbenchmark(f1(x[[1]], y[[1]]), f2(x[[1]], y[[1]]), f4(df[[1]]), f5(df[[1]]), times = 10)
Unit: microseconds
                expr       min        lq       mean     median        uq
  f1(x[[1]], y[[1]])    10.042    10.324    13.3511    13.4425    14.690
 f2a(x[[1]], y[[1]])  1310.186  1316.869  1480.1526  1344.8795  1386.322
         f4(df[[1]])  1329.307  1336.869  1363.4238  1358.7080  1365.427
         f5(df[[1]]) 37051.756 37106.026 38187.8278 37876.0940 38416.276
       max neval
    20.753    10
  2676.030    10
  1439.402    10
 42292.588    10

遍历列并计算满足R

2 个答案: