Question

如果我们能够首先计算powerset的所有元素然后从中随机抽取样本，那么很容易生成powerset的随机子集：

set.seed(12)
x = 1:4
n.samples = 3

library(HapEstXXR)
power.set = HapEstXXR::powerset(x)
sample(power.set, size = n.samples, replace = FALSE)
# [[1]]
# [1] 2
# 
# [[2]]
# [1] 3 4
# 
# [[3]]
# [1] 1 3 4

但是，如果x的长度很大，那么powerset的元素就会太多。因此，我正在寻找一种直接计算随机子集的方法。一种可能性是首先绘制“随机长度”，然后使用“随机长度”绘制x的随机子集：

len = sample(1:length(x), size = n.samples, replace = TRUE)
len
# [1] 2 1 1

lapply(len, function(l) sort(sample(x, size = l)))
# [[1]]
# [1] 1 2
# 
# [[2]]
# [1] 1
# 
# [[3]]
# [1] 1

然而，这会产生重复。当然，我现在可以删除重复项并使用while循环重复上一次采样，直到我最终获得powerset的n.samples非重复随机子集：

drawSubsetOfPowerset = function(x, n) {
  ret = list()
  while(length(ret) < n) {
    # draw a "random length" with some meaningful prob to reduce number of loops
    len = sample(0:n, size = n, replace = TRUE, prob = choose(n, 0:n)/2^n)
    # draw random subset of x using the "random length" and sort it to better identify duplicates
    random.subset = lapply(len, function(l) sort(sample(x, size = l)))
    # remove duplicates
    ret = unique(c(ret, random.subset))
  }
  return(ret)
}

drawSubsetOfPowerset(x, n.samples)

当然，我现在可以尝试优化drawSubsetOfPowerset功能的几个组件，例如（1）试图避免在循环的每次迭代中复制对象ret，（2）使用更快的排序，（3）使用更快的方法来删除列表的重复，...

我的问题是：这样做会有不同的方式（效率更高）吗？

Answer 1

如何使用二进制表示？这样，我们就可以从2^length(v)给出的幂集总数的长度生成一个随机的整数子集。从那里我们可以使用intToBits以及索引来保证我们以有序的方式生成幂集的随机唯一子集。

randomSubsetOfPowSet <- function(v, n, mySeed) {
    set.seed(mySeed)
    lapply(sample(2^length(v), n) - 1, function(x) v[intToBits(x) > 0])
}

取x = 1:4，n.samples = 5和随机种子42，我们有：

randomSubsetOfPowSet(1:4, 5, 42)
[[1]]
[1] 2 3 4

[[2]]
[1] 1 2 3 4

[[3]]
[1] 3

[[4]]
[1] 2 4

[[5]]
[1] 1 2 3

说明

二进制表示与电源组有什么关系？

事实证明，给定一个集合，我们可以通过转向比特来找到所有子集（是，0和1）。通过将子集中的元素视为原始集合中的on元素以及不在该子集中的元素off，我们现在可以非常切实地思考如何生成每个子集。观察：

       Original set: {a,  b,  c,  d}
                      |   |   |   |
                      V   V   V   V        b & d 
Existence in subset: 1/0 1/0 1/0 1/0       are on
                                            / \
                                           /   \
                                          |     |
                                          V     V
Example subset: {b, d} gets mapped to {0, 1, 0, 1}
                                       |  \   \  \_______
                                       |   |   \__       \
                                       |   |___   \____   \____
                                       |       |       |       |
                                       V       V       V       V
Thus, {b, d} is mapped to the integer  0*2^0 + 1*2^1 + 0*2^2 + 1*2^3 = 10

现在这是长度为 n 的位组合的问题。如果您对A = {a, b, c, d}的每个子集进行映射，则会获得0:15。因此，为了获得A的幂集的随机子集，我们简单地生成0:15的随机子集并将每个整数映射到A的子集。我们怎么能这样做？

想到了{p> sample。

现在，也很容易走另一条路（即从整数到原始集的子集）

观察：

Given the integer 10 and set A given above (i.e. {a, b, c, d}) we have:

10 in bits is -->> {0, 1, 0, 1}

Which indices are greater than 0?

Answer: 2 and 4

取第2个第4个元素给出：{b，d} et Voila！

直接生成powerset的随机子集

1 个答案:

说明