Question

有一个包含这些ID的ID和权重的哈希值。

y = { 1 => 0.7, 2 => 0.2, 3 => 0.1 }

我想根据权重对这个哈希进行洗牌。

我尝试了许多不同的方法，所有这些方法都给了我类似的，意想不到的结果。这是我发现的最简洁。

y.sort_by {|v| -v[1]*rand()}

当我运行这一万次并挑出第一批ID时，我得到以下数字：

{1=>8444, 2=>1316, 3=>240}

我预计这些计数会反映上面的权重（例如1 =＆gt; 7000）。对于我为什么这次改组与那些重量不匹配，我有点模糊。有人可以解决我的困惑，并告诉如何解决它？

以下是我发现的一些有用的资料来源：

Answer 1

以下是使用Enumerable#max_by执行加权随机抽样的另一种方法，以及来自Efraimidis and Spirakis的惊人结果：

给定一个散列，其值代表总和为1的概率，我们可以得到这样的加权随机抽样：

# hash of ids with their respective weights that sum to 1
y = { 1 => 0.7, 2 => 0.2, 3 => 0.1 }

# lambda that randomly returns a key from y in proportion to its weight
wrs = -> { y.max_by { |_, weight| rand ** (1.0/weight) }.first }

# test run to see if it works
10_000.times.each_with_object(Hash.new(0)) { |_, freq| freq[wrs.call] += 1 }

# => {1=>6963, 3=>979, 2=>2058}

另一方面，已经talk将加权随机抽样添加到Array#sample，但该功能似乎在随机播放中丢失了。

进一步阅读：

Ruby-Doc

Enumerable#max_by

wsample

Weighted Random Sampling由Efraimidis和Spirakis（2005）介绍算法
New features for Array#sample, Array#choice提到了将加权随机抽样添加到Array#sample

Answer 2

这是一个非常低效但有希望有效的解决方案：（虽然我对正确性没有任何承诺！加上代码不会让太多的Rubyist感到高兴......）。

算法的本质就像根据重量随机选取一个元素，删除它，然后重复其余元素一样简单。

def shuffle some_hash
   result = []

   numbers = some_hash.keys
   weights = some_hash.values
   total_weight = weights.reduce(:+)

   # choose numbers one by one
   until numbers.empty?
      # weight from total range of weights
      selection = rand() * total_weight

      # find which element this corresponds with
      i = 0
      while selection > 0
         selection -= weights[i]
         i += 1
      end
      i -= 1

      # add number to result and remove corresponding weight
      result << numbers[i]
      numbers.delete_at i
      total_weight -= weights.delete_at(i)
   end

   result
end

Answer 3

你给出了概率密度函数（P用于“proability”：

P(1) = 0.7
P(2) = 0.3
P(3) = 0.1

您需要构建（累积）分布函数，如下所示：

Distribution function

我们现在可以生成0到1之间的随机数，在Y轴上绘制它们，在右边画一条线以查看它们与分布函数的交点，然后读取关联的X坐标作为随机变量。因此，如果随机数小于0.7，则随机变量为1;如果在0.7到0.9之间，则随机变量为2，如果概率超过3，则随机变量为0.9。（请注意，rand完全等于0.7（说）的概率几乎为零，因此我们不必抱歉区分< 0.7和<= 0.7。）

要实现这一点，首先要计算哈希df：

y = { 1 => 0.7, 2 => 0.2, 3 => 0.1 }

last = 0.0
df = y.each_with_object({}) { |(v,p),h| last += p; h[last.round(10)] = v }
  #=> {0.7=>1, 0.9=>2, 1.0=>3}

现在我们可以创建一个随机变量，如下所示：

def rv(df)
  rn = rand
  df.find { |p,_| rn < p }.last
end

我们试一试：

def count(df,n)
  n.times.each_with_object(Hash.new(0)) { |_,count|
    count[rv(df)] += 1 }
end

n = 10_000
count(df,n)
  #=> {1=>6993, 2=>1960, 3=>1047} 
count(df,n)
  #=> {1=>6986, 2=>2042, 3=>972} 
count(df,n)
  #=> {1=>6970, 2=>2039, 3=>991}

请注意，键值对count的顺序由前几个随机变量的结果决定，因此键不一定按照它们在此处的顺序。

Answer 4

如果你将权重设为整数值，如下所示：

y = { 1 => 7, 2 => 2, 3 => 1 }

然后你可以构造一个数组，其中数组中每个项目的出现次数基于权重：

weighted_occurrences = y.flat_map { |id, weight| Array.new(weight, id) }
# => [1, 1, 1, 1, 1, 1, 1, 2, 2, 3]

然后进行加权洗牌就像：

weighted_occurrences.shuffle.uniq

在10,000次洗牌并挑出第一批ID之后，我得到：

{
  1 => 6988,
  2 => 1934,
  3 => 1078
}

随机调整加权数组

4 个答案: