Question

以下示例基于discussion关于将expand.grid与大数据一起使用。如你所见，它最终会出错。我想这是因为根据mentioned第68.7亿页的可能组合：

> v1 <-  c(1:8)
> v2 <-  c(1:8)
> v3 <-  c(1:8)
> v4 <-  c(1:8)
> v5 <-  c(1:8)
> v6 <-  c(1:8)
> v7 <-  c(1:8)
> v8 <-  c(1:8)
> v9 <-  c(1:8)
> v10 <- c(1:8)
> v11 <- c(1:8)
> v12 <- c(1:8)
> expand.grid(v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12)
Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) : 
  invalid 'times' value
In addition: Warning message:
In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
  NAs introduced by coercion to integer range

即使有8个向量，它也会杀死我的CPU和/或RAM（> expand.grid(v1, v2, v3, v4, v5, v6, v7, v8)）。 Here我发现了一些改进建议使用outer或rep.int。这些解决方案适用于两个向量，因此我无法将其应用于12个向量，但我猜原理是相同的：它创建了驻留在内存中的大型矩阵。我想知道是否有类似python的xrange懒惰评估？ Here我发现了delayedAssign功能，但我想这无济于事，因为还提到了以下内容：

不幸的是，R会在指向它们时评估惰性变量数据结构，即使当时不需要它们的值。这个意味着无限的数据结构，一个常见的应用在Haskell的懒惰，在R中是不可能的。

使用嵌套循环只解决这个问题吗？

PS：我没有具体的问题，但是假设你需要使用接受12个整数参数的函数进行一些计算，出于某种原因。还假设您需要对这12个整数进行所有组合并将结果保存到文件中。使用12个嵌套循环并将结果连续保存到文件将起作用（尽管它会很慢但不会杀死你的RAM）。显示Here如何使用expand.grid和apply函数替换两个嵌套循环。问题是使用expand.grid创建具有12个长度为8的向量的矩阵具有一些缺点：

生成这样的矩阵很慢
这样的大型矩阵消耗大量内存（687亿行和8列）
使用apply进一步迭代此矩阵也很慢

因此，在我看来，功能方法比程序解决方案慢得多。我只是想知道是否有可能懒洋洋地创建大型数据结构，理论上它不适合内存并迭代它。就是这样。

Answer 1

一种（可以说是更多＆＃34;正确的＆＃34;）方法是为@BenBolker建议的iterators编写自己的迭代器（编写扩展名为here时的pdf）。缺乏更正式的东西，这里是一个穷人的迭代器，类似于expand.grid但是手动推进。（注意：这就足够了，因为每次迭代的计算都比这个函数本身更昂贵了。这真的可以改进，但是＆＃34;它可以工作＆＃34;。）

每次返回返回的函数时，此函数都会返回一个命名列表（带有提供的因子）。它是懒惰的，因为它没有扩展整个可能的列表;它本身并不是懒惰的论点，它们应该被消费掉＃39;立即

lazyExpandGrid <- function(...) {
  dots <- list(...)
  sizes <- sapply(dots, length, USE.NAMES = FALSE)
  indices <- c(0, rep(1, length(dots)-1))
  function() {
    indices[1] <<- indices[1] + 1
    DONE <- FALSE
    while (any(rolls <- (indices > sizes))) {
      if (tail(rolls, n=1)) return(FALSE)
      indices[rolls] <<- 1
      indices[ 1+which(rolls) ] <<- indices[ 1+which(rolls) ] + 1
    }
    mapply(`[`, dots, indices, SIMPLIFY = FALSE)
  }
}

样本用法：

nxt <- lazyExpandGrid(a=1:3, b=15:16, c=21:22)
nxt()
#   a  b  c
# 1 1 15 21
nxt()
#   a  b  c
# 1 2 15 21
nxt()
#   a  b  c
# 1 3 15 21
nxt()
#   a  b  c
# 1 1 16 21

## <yawn>

nxt()
#   a  b  c
# 1 3 16 22
nxt()
# [1] FALSE

注意：为了简洁显示，我使用as.data.frame(mapply(...))作为示例;它可以正常工作，但如果命名列表适合您，那么转换为data.frame是不必要的。

修改

基于alexis_laz's answer，这是一个大大改进的版本，它（a）速度快得多，（b）允许任意搜索。

lazyExpandGrid <- function(...) { dots <- list(...) argnames <- names(dots) if (is.null(argnames)) argnames <- paste0('Var', seq_along(dots)) sizes <- lengths(dots) indices <- cumprod(c(1L, sizes)) maxcount <- indices[ length(indices) ] i <- 0 function(index) { i <<- if (missing(index)) (i + 1L) else index if (length(i) > 1L) return(do.call(rbind.data.frame, lapply(i, sys.function(0)))) if (i > maxcount || i < 1L) return(FALSE) setNames(Map(`[[`, dots, (i - 1L) %% indices[-1L] %/% indices[-length(indices)] + 1L ), argnames) } }

它不使用任何参数（自动递增内部计数器），一个参数（搜索并设置内部计数器）或向量参数（寻找每个并将计数器设置为最后一个，返回data.frame）

最后一个用例允许对设计空间的子集进行采样：

set.seed(42) nxt <- lazyExpandGrid2(a=1:1e2, b=1:1e2, c=1:1e2, d=1:1e2, e=1:1e2, f=1:1e2) as.data.frame(nxt()) # a b c d e f # 1 1 1 1 1 1 1 nxt(sample(1e2^6, size=7)) # a b c d e f # 2 69 61 7 7 49 92 # 21 72 28 55 40 62 29 # 3 88 32 53 46 18 65 # 4 88 33 31 89 66 74 # 5 57 75 31 93 70 66 # 6 100 86 79 42 78 46 # 7 55 41 25 73 47 94

感谢alexis_laz改进了cumprod，Map和索引计算！

Answer 2

另一种看起来有效的方法..：

exp_gr = function(..., index)
{
    args = list(...)
    ns = lengths(args)
    offs = cumprod(c(1L, ns))
    n = offs[length(offs)]

    stopifnot(index <= n)

    i = (index[[1L]] - 1L) %% offs[-1L] %/% offs[-length(offs)] 

    return(do.call(data.frame, 
           setNames(Map("[[", args, i + 1L), 
                    paste("Var", seq_along(args), sep = ""))))
}

在上面的函数中，...是expand.grid的参数，index是越来越多的组合。 E.g：

expand.grid(1:3, 10:12, 21:24, letters[2:5])[c(5, 22, 24, 35, 51, 120, 144), ]
#    Var1 Var2 Var3 Var4
#5      2   11   21    b
#22     1   11   23    b
#24     3   11   23    b
#35     2   12   24    b
#51     3   11   22    c
#120    3   10   22    e
#144    3   12   24    e
do.call(rbind, lapply(c(5, 22, 24, 35, 51, 120, 144), 
                      function(i) exp_gr(1:3, 10:12, 21:24, letters[2:5], index = i)))
#  Var1 Var2 Var3 Var4
#1    2   11   21    b
#2    1   11   23    b
#3    3   11   23    b
#4    2   12   24    b
#5    3   11   22    c
#6    3   10   22    e
#7    3   12   24    e

在大型建筑物上：

expand.grid(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2)
#Error in rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) : 
#  invalid 'times' value
#In addition: Warning message:
#In rep.int(rep.int(seq_len(nx), rep.int(rep.fac, nx)), orep) :
#  NAs introduced by coercion to integer range
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1    1    1    1    1    1    1
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e3 + 487)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1   87   15    1    1    1    1
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e2 ^ 6)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1  100  100  100  100  100  100
exp_gr(1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, 1:1e2, index = 1e11 + 154)
#  Var1 Var2 Var3 Var4 Var5 Var6
#1   54    2    1    1    1   11

类似的方法是构建一个＆＃34;类＆＃34;存储...参数以使用expand.grid并定义[方法以在需要时计算适当的组合索引。但是，使用%%和%/%似乎是有效的，我猜这些运算符的迭代速度会比它需要的慢。

对于R的Python的xrange替代方法如何循环遍历大型数据集？

2 个答案: