Question

我试图模拟描绘“世界的真实状态”（例如，“红色”，“绿色”或“蓝色”）及其指标的某些离散变量，有点不完美地描述它。

r_names <- c("real_R", "real_G", "real_B")

假设我之前对“现实”变量的分布有一些信心，我将用它来对其进行抽样。

r_probs <- c(0.3, 0.5, 0.2)
set.seed(100)
reality <- sample(seq_along(r_names), 10000, prob=r_probs, replace = TRUE)

现在，假设我有条件概率表，规定每个“现实”给出的指标值

ri_matrix <- matrix(c(0.7, 0.3, 0, 
                      0.2, 0.6, 0.2, 
                      0.05,0.15,0.8), byrow=TRUE,nrow = 3)
dimnames(ri_matrix) <- list(paste("real", r_names, sep="_"),
                        paste("ind", r_names, sep="_"))

ri_matrix

>#            ind_R ind_G ind_B
># real_Red    0.70  0.30   0.0
># real_Green  0.20  0.60   0.2
># real_Blue   0.05  0.15   0.8

由于base::sample()没有为prob参数进行矢量化，因此我必须：

sample_cond <- function(r, rim){
  unlist(lapply(r, function(x) 
    sample(seq_len(ncol(rim)), 1, prob = rim[x,], replace = TRUE)))
 }

现在我可以使用条件概率矩阵

对我的“指标”变量进行采样

set.seed(200)
indicator <- sample_cond(reality, ri_matrix)

只是为了确保分发符合预期：

prop.table(table(reality, indicator), margin = 1)

 #>        indicator
 #> reality          1          2          3
 #>       1 0.70043610 0.29956390 0.00000000
 #>       2 0.19976124 0.59331476 0.20692400
 #>       3 0.04365278 0.14400401 0.81234320

是否有更好的（即更惯用和/或更有效）的方法来采样另一个离散随机变量的离散变量？

更新

根据@ Mr.Flick的建议，这至少要快50倍，因为它重用了概率向量而不是条件概率矩阵的重复子集。

sample_cond_group <- function(r, rim){
il <- mapply(function(x,y){sample(seq(ncol(rim)), length(x), prob = y, replace = TRUE)}, 
       x=split(r, r),
       y=split(rim, seq(nrow(rim))))
unsplit(il, r)
}

Answer 1

通过使用拆分/组合类型策略为每个组绘制所有随机样本，可以提高效率。这可能看起来像这样

simFun <- function(N, r_probs, ri_matrix) {
  stopifnot(length(r_probs) == nrow(ri_matrix))
  ind <- sample.int(length(r_probs), N, prob = r_probs, replace=TRUE)
  grp <- split(data.frame(ind), ind)
  unsplit(Map(function(data, r) {
    draw <-sample.int(ncol(ri_matrix), nrow(data), replace=TRUE, prob=ri_matrix[r, ])
    data.frame(data, draw)
    }, grp, as.numeric(names(grp))), ind)
}

比你可以用

打电话

simFun(10000, r_probs, ri_matrix)

使用条件概率表进行抽样

1 个答案: