Question

我正在寻找一种更有效的方法从整数列表1：n中多次采样，其中概率向量（也是长度n）每次都不同。对于n = 10的20个试验，我知道可以这样做：

probs <- matrix(runif(200), nrow = 20)
answers <- numeric(20)
for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,])

但是，每次调用样本10次只是为了得到一个数字，所以它可能不是最快的方式。速度会有所帮助，因为代码会这么做很多次。

非常感谢！

路

编辑：非常感谢罗曼，他对基准测试的想法帮助我找到了一个很好的解决方案。我现在把它转到了答案。

Answer 1

为了好玩，我又试了两个版本。你在做这个抽样的规模是多少？我认为所有这些都非常快，并且或多或少相当（我没有为您的解决方案创建probs）。很想看到别人对此有所了解。

library(rbenchmark)
benchmark(replications = 1000,
          luke = for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,]),
          roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
          roman2 = replicate(20, sample(10, 1, prob = runif(10))))

    test replications elapsed relative user.self sys.self user.child sys.child
1   luke         1000    0.41    1.000      0.42        0         NA        NA
2  roman         1000    0.47    1.146      0.46        0         NA        NA
3 roman2         1000    0.47    1.146      0.44        0         NA        NA

Answer 2

这是我找到的另一种方法。它速度很快，但没有像使用for循环多次调用样本那么快。我最初认为它非常好，但是我错误地使用了基准（）。

luke2 = function(probs) { # takes a matrix of probability vectors, each in its own row
                probs <- probs/rowSums(probs) 
                probs <- t(apply(probs,1,cumsum)) 
                answer <- rowSums(probs - runif(nrow(probs)) < 0) + 1 
                return(answer)  }

以下是它的工作原理：将概率描绘为从0到1的数字线上布置的各种长度的线。大概率线将占据数字线的大部分而不是小线。然后，您可以通过在数字线上选择一个随机点来选择结果 - 大概率将更有可能被选中。这种方法的优点是你可以滚动一次runif（）调用所需的所有随机数，而不是像函数luke，roman和roman2那样反复调用样本。但是，看起来额外的数据处理速度会降低，并且成本会抵消这一好处。

library(rbenchmark)
probs <- matrix(runif(2000), ncol = 10)
answers <- numeric(200)

benchmark(replications = 1000,
          luke = for(i in 1:20) answers[i] <- sample(10,1,prob=probs[i,]),
          luke2 = luke2(probs),
          roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
          roman2 = replicate(20, sample(10, 1, prob = runif(10))))
              roman = apply(probs, MARGIN = 1, FUN = function(x) sample(10, 1, prob = x)),
              roman2 = replicate(20, sample(10, 1, prob = runif(10))))

    test replications elapsed relative user.self sys.self user.child sys.child
    1   luke         1000   0.171    1.000     0.166    0.005          0         0
    2  luke2         1000   0.529    3.094     0.518    0.012          0         0
    3  roman         1000   1.564    9.146     1.513    0.052          0         0
    4 roman2         1000   0.225    1.316     0.213    0.012          0         0

出于某种原因，当您添加更多行时，apply（）会非常糟糕。我不明白为什么，因为我认为它是for（）的包装器，因此roman（）应该与luke（）类似地执行。

从不同概率向量中采样的有效方法

2 个答案: