Question

我有一个非负整数articles.length。我需要生成n范围为0...articles.length的非重复数字对的列表，即（对于articles.length == 10和n == 5），

[[1, 3], [2, 6], [1, 6], [8, 1], [10, 3]]

我该怎么做？

Answer 1

计算

mx = 10
n = 20
(0..(mx+1)**2-1).to_a.sample(n).map { |n| n.divmod(mx+1) }
  #=> [[6, 9], [3, 8], [7, 10], [3, 3], [2, 0], [8, 9], [4, 1], [9, 4], 
  #    [1, 0], [1, 8], [9, 6], [0, 10], [9, 0], [6, 8], [4, 9], [2, 10],
  #    [10, 0], [10, 5], [6, 10], [2, 9]]

说明

当两对之间存在1-1映射时，对数字进行无替换采样与对不进行替换的单个数字采样相同。将单个数字视为基数mx+1中的2位数字，因此每个数字的范围可以在0和mx之间，也就是说，对应于一对数字中的一个元素采样。有(mx+1)**2个两位数的基数mx+1，其基数10的范围是0到(mx+1)**2-1。因此，我们只需要对n进行(0..(mx+1)**2-1).to_a次采样，然后使用Integer#divmod将每个采样的十进制数转换回基数mx+1的两位数（以基数{{ 1}}）。

该过程显然没有偏见。

替代方法：生成对并丢弃重复项

如果10相对于(mx+1)**2-1)足够大，最快的方法可能是以下方法（它也会产生无偏样本）：

我发现，在require 'set' samp = Set.new limit = mx+1 while samp.size < n samp << [rand(limit), rand(limit)] end samp.to_a #=> [[3, 6], [6, 2], [0, 3], [10, 0], [1, 8], [3, 4], [10, 3], [0, 4], # [6, 7], [10, 7], [9, 1], [10, 5], [2, 7], [4, 8], [8, 4], [7, 3], # [2, 4], [7, 10], [5, 3], [6, 3]]次20对的随机抽取样本中（全部用于100），生成了mx = 10对的平均值重复副本被丢弃后有20对唯一的对。

基准

我认为对建议的几种方法进行基准测试可能会很有趣。

21.86

将要测试的方法放在模块¹中很方便。

require 'benchmark'
require 'set'

module Candidates 
  def samp_with_difmod(mx, n)
    (0..(mx+1)**2-1).to_a.sample(n).map { |n| n.divmod(mx+1) }
  end

  def discard_dups(mx, n)
    samp = Set.new
    limit = mx+1
    while samp.size < n
      samp << [rand(limit), rand(limit)]
    end
    samp.to_a
  end

  def sawa_repeated_perm(mx, n)
    (0..mx).to_a.repeated_permutation(2).to_a.sample(n)
  end

  def sawa_product(mx, n)
    (0..mx).to_a.product((0..mx).to_a).sample(n)
  end
end

include Candidates
@candidates = Candidates.public_instance_methods(false)
  #=> [:samp_with_difmod, :discard_dups, :sawa_repeated_perm, :sawa_product]
@indent = candidates.map { |m| m.to_s.size }.max
  #=> 18

def bench(mx, n, reps)
  puts "\n0-#{mx}, sample size = #{n}, #{reps} reps"
  Benchmark.bm(@indent) do |bm|
    @candidates.each do |m|
      bm.report m.to_s do
        reps.times { send(m, mx, n) }
     end
    end
  end
end

bench(10, 20, 100)
0-10, sample size = 20, 100 reps
                         user     system      total        real
samp_with_difmod     0.000000   0.000000   0.000000 (  0.002536)
discard_dups         0.000000   0.000000   0.000000 (  0.005312)
sawa_repeated_perm   0.000000   0.000000   0.000000 (  0.004901)
sawa_product         0.000000   0.000000   0.000000 (  0.004742)

bench(100, 20, 100)
0-100, sample size = 20, 100 reps
                         user     system      total        real
samp_with_difmod     0.031250   0.015625   0.046875 (  0.088003)
discard_dups         0.000000   0.000000   0.000000 (  0.005618)
sawa_repeated_perm   0.093750   0.000000   0.093750 (  0.136010)
sawa_product         0.125000   0.000000   0.125000 (  0.138848)

在上面，抽样是从bench(10, 121, 100) 0-10, sample size = 121, 100 reps user system total real samp_with_difmod 0.000000 0.000000 0.000000 ( 0.003283) discard_dups 0.171875 0.015625 0.187500 ( 0.208459) sawa_repeated_perm 0.000000 0.000000 0.000000 ( 0.004253) sawa_product 0.000000 0.000000 0.000000 ( 0.002947)总体中进行的。从11**2 #=> 121的总体中抽取121的样本而不进行替换意味着该样本由总体中的所有对组成。因此，121的性能相对较差也就不足为奇了。例如，在绘制了120对唯一的对之后，它将不断拒绝重复项，直到偶然发现尚未进入样本的其余对为止。

discard_dups

bench(100, 100, 100)
0-100, sample size = 100, 100 reps
                         user     system      total        real
samp_with_difmod     0.046875   0.000000   0.046875 (  0.042177)
discard_dups         0.031250   0.000000   0.031250 (  0.029467)
sawa_repeated_perm   0.109375   0.000000   0.109375 (  0.132429)
sawa_product         0.125000   0.000000   0.125000 (  0.140451)

在最后一个基准测试中，相对于样本量（bench(1000, 500, 10) 0-1000, sample size = 500, 10 reps user system total real samp_with_difmod 0.437500 0.140625 0.578125 ( 0.632434) discard_dups 0.015625 0.000000 0.015625 ( 0.013634) sawa_repeated_perm 1.718750 0.359375 2.078125 ( 2.166724) sawa_product 1.734375 0.062500 1.796875 ( 1.853555)而言最大值（1000）越来越大，500显然是赢家。这里样本空间的大小为discard_dups，因此1001**2 #=> 1_002_001在绘制大小为discard_dups的样本时会遇到相对较少的重复项。

500的性能比sawa_product好很多，但是在其他测试中，两种方法的性能相似。

^{1包含一个包含要测试的方法的模块可以简化代码，并使添加，删除和重命名要测试的方法变得容易。}

Answer 2

不是最有效的方法，但以下方法可行。

(0...10).to_a.repeated_permutation(2).to_a.sample(5)
#=> [[8, 4], [2, 9], [5, 0], [5, 4], [4, 3]]

Answer 3

如果您需要定期执行此操作，我们可以创建一个为我们执行此操作的枚举器：（感谢@CarySwoveland的概念数学和Set的使用）

require 'set'
def generator(limit,size=2) 
  enum_size = (limit.is_a?(Range) ? limit.size : limit += 1) ** size 
  if enum_size.infinite?
    limit = limit.is_a?(Range) ? (limit.first..Float::MAX.to_i) : Float::MAX.to_i
  end       
  Enumerator.new(enum_size) do |y| 
    s = Set.new
    loop do 
      new_rand = Array.new(size) { rand(limit) }
      y << new_rand if s.add?(new_rand)
      raise StopIteration if s.size == enum_size
    end
  end
end

现在获取n对并不需要我们在采样之前生成所有可能的排列。相反，我们会根据需要生成n个随机对。（不超过可能的最大排列数）。

用法：

g = combination_generator(10)
g.take(5)
#=> [[10, 4], [9, 6], [9, 9], [2, 6], [4, 6]]
g.take(5)
#=> [[9, 7], [2, 8], [2, 2], [8, 8], [7, 3]]
g.to_a.sort
#=> [[0, 0], [0, 1], [0, 2], [0, 3], [0, 4], [0, 5], [0, 6], [0, 7], [0, 8], [0, 9], [0, 10], 
#    [1, 0], [1, 1], [1, 2], [1, 3], [1, 4], [1, 5], [1, 6], [1, 7], [1, 8], [1, 9], [1, 10], 
# ..., [10,10]]

使用范围也像generator((2..7))一样只会在[2,2]和[7,7]之间生成组合。

此外，我添加了允许在不牺牲生成速度的情况下允许任意数量的子集元素的功能，例如

g = generator((0..Float::INFINITY),20)
g.size 
#=>  Infinity 
g.first
#=> [20 extremely large numbers]
require 'benchmark'
Benchmark.bmbm do |x| 
  x.report(:fast) { g.first(10) } 
  x.report(:large_fast) { g.first(10_000) } 
end

# Rehearsal ----------------------------------------------
# fast         0.000552   0.000076   0.000628 (  0.000623)
# large_fast   0.612065   0.035515   0.647580 (  0.672739)
# ------------------------------------- total: 0.648208sec
#                  user     system      total        real
# fast         0.000728   0.000000   0.000728 (  0.000744)
# large_fast   0.598493   0.000000   0.598493 (  0.607784)

Answer 4

仅当n <= s以及一对包含equals元素的对可接受时：

def random_pairs n, s
  res = []
  (1..n).to_a.tap { |a| res = a.sample(s).zip(a.sample(s)) }
  res
end

random_pairs 10, 5
#=> [[6, 5], [1, 1], [9, 6], [8, 4], [4, 7]]

对不重复，但可以有一对包含equals元素的对子：[1,1]

要遵循评论，但要慢得多：

def random_pairs_bis n, s
  res = []
  s.times {(1..n).to_a.tap { |a| res << a.shuffle.zip(a.shuffle) }}
  res.flatten(1).uniq.sample(s)
end

Answer 5

这可能足以回答OP：

articles = ["apples","pencils","chairs","pencils","guitars","parsley", "ink","stuff" ]
n = 5
p random_pairs = Array.new(n){articles.sample(2) }

# => [["parsley", "apples"], ["pencils", "chairs"], ["pencils", "apples"], ["stuff", "apples"], ["stuff", "guitars"]]

生成范围内数字的随机数对

5 个答案: