Question

我正在尝试从0到7（随替换）随机抽样7个数字，但受限于所选数字加起来为7.所以例如，输出0 1 1 2 3 0 0是可以的，但输出1 2 3 4 5 6 7不是。有没有办法使用带有附加约束的sample命令？

我打算使用带有sample命令的replicate（）函数作为参数，从sample命令返回N个不同向量的列表。我目前使用示例命令的方式（没有任何约束），我需要N非常大，以便获得尽可能多的可能向量，总和精确到7。我认为必须有一种更简单的方法来做到这一点！

以下是该部分的代码：

x <- replicate(100000, sample(0:7, 7, replace=T))

理想情况下，我希望x中的10,000或100,000个向量总和为7，但是需要一个巨大的N值才能做到这一点。谢谢你的帮助。

Answer 1

为了确保您统一采样，您可以生成所有排列并限制为总和为7的那些：

library(gtools)
perms <- permutations(8, 7, 0:7, repeats.allowed=T)
perms7 <- perms[rowSums(perms) == 7,]

从nrow(perms7)，我们看到只有1716种可能的排列总和为7.现在你可以从排列中统一抽样：

set.seed(144)
my.perms <- perms7[sample(nrow(perms7), 100000, replace=T),]
head(my.perms)
#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,]    0    0    0    2    5    0    0
# [2,]    1    3    0    1    2    0    0
# [3,]    1    4    1    1    0    0    0
# [4,]    1    0    0    3    0    3    0
# [5,]    0    2    0    0    0    5    0
# [6,]    1    1    2    0    0    2    1

这种方法的一个优点是很容易看到我们随机均匀采样。此外，它非常快 - 构建perms7在我的计算机上耗时0.3秒，构建100万行my.perms需要0.04秒。如果你需要绘制很多向量，这将比递归方法快得多，因为你只是使用矩阵索引到perms7而不是单独生成每个向量。

以下是样本中数字计数的分布：

#      0      1      2      3      4      5      6      7 
# 323347 188162 102812  51344  22811   8629   2472    423

Answer 2

从全零开始，向任意元素添加一个，执行7次：

sumTo = function(){
    v = rep(0,7)
    for(i in 1:7){
        addTo=sample(7)[1]
        v[addTo]=v[addTo]+1
    }
    v
}

或等效地，只需在一个长度为7的样本中选择要增加的7个元素中的哪一个，然后将这些元素制成表格，确保最多列表7：

sumTo = function(){tabulate(sample(7, 7, replace = TRUE), 7)}


> sumTo()
[1] 2 1 0 0 4 0 0
> sumTo()
[1] 1 3 1 0 1 0 1
> sumTo()
[1] 1 1 0 2 1 0 2

我不知道这是否会产生所有可能组合的统一样本......

超过100,000个代表的个别元素的分布是：

> X = replicate(100000,sumTo())
> table(X)
X
     0      1      2      3      4      5      6 
237709 277926 138810  38465   6427    627     36

当时没有达到0,0,0,0,0,7！

Answer 3

这种递归算法将输出一个分布，其概率大于其他解。我们的想法是在y中的七个可用广告位中的任意一个投放一个随机数0:7，然后在0:(7-y)等中使用随机数重复：

sample.sum <- function(x = 0:7, n = 7L, s = 7L) {
   if (n == 1) return(s)
   x <- x[x <= s]
   y <- sample(x, 1)
   sample(c(y, Recall(x, n - 1L, s - y)))
}

set.seed(123L)
sample.sum()
# [1] 0 4 0 2 0 0 1

在我的机器上绘制100,000个向量花了11秒，这是我得到的分布：

#      0      1      2      3      4      5      6      7 
# 441607  98359  50587  33364  25055  20257  16527  14244

Answer 4

可能有一种更简单和/或更优雅的方式，但这是使用LSPM:::.nPri功能的强力方法。该链接包括对于那些感兴趣的人的算法的R-only版本的定义。

#install.packages("LSPM", repos="http://r-forge.r-project.org")
library(LSPM)
# generate all possible permutations, since there are only ~2.1e6 of them
# (this takes < 40s on my 2.2Ghz laptop)
x <- lapply(seq_len(8^7), nPri, n=8, r=7, replace=TRUE)
# set each permutation that doesn't sum to 7 to NULL
y <- lapply(x, function(p) if(sum(p-1) != 7) NULL else p-1)
# subset all non-NULL permutations
z <- y[which(!sapply(y, is.null))]

现在您可以从z进行抽样，并确保您获得的总数为7的排列。

Answer 5

我发现这个问题很有趣，并给了它一些额外的想法。在所有可行解决方案中均匀地（近似）采样的另一种（更一般的）方法，在不生成和存储所有排列的情况下（在超过7个数字的情况下显然不可能），在sample()的R中，是一个简单的MCMC实现：

S <- c(0, 1, 1, 2, 3, 0, 0) #initial solution
N <- 100 #number of dependent samples (or burn in period)
series <- numeric(N)
for(i in 1:N){
    b <- sample(1:length(S), 2, replace=FALSE) #pick 2 elements at random
    opt <- sum(S[-b]) #sum of complementary elements
    a <- sample(0:(7-opt), 1) #sample a substistute
    S[b[1]] <- a #change elements
    S[b[2]] <- 7 - opt - a 
}
S #new sample

对于一些样品来说，这当然非常快。 “分配”：

#"distribution" N=100.000:      0      1      2      3      4      5      6      7
#                            321729 189647 103206  52129  22287   8038   2532    432

当然，在这种情况下，实际上可以找到并存储所有组合，如果你想从所有可行结果中获得大量样本，只需使用partitions::compositions(7, 7)，这也是Josh O'Brien在评论，以避免计算所有排列，只需要一小部分：

perms7 <- partitions::compositions(7, 7)

>tabulate(perms7[, sample(ncol(perms7), 100000, TRUE)]+1, 8)
#"distribution" N=100.000:      0      1      2      3      4      5      6      7
#                            323075 188787 102328  51511  22754   8697   2413    435

R：sample（）命令受约束条件限制

5 个答案: