Scala:我如何根据预期的分布生成数字?

时间:2014-07-21 15:51:15

标签: scala random distribution

当想要生成较小范围的随机数或者您已经知道某些数字已经附加了相关概率时,Scala的随机数并不起作用。显然,scala.util.Random.nextInt不会完成整个工作。

如何根据重量选择数字?

2 个答案:

答案 0 :(得分:18)

这是一个简单的解释。想象一下,您对值A,B,C有多项式,它们具有以下概率:

  • A = 0.5
  • B = 0.3
  • C = 0.2

如果你想根据它的概率对一个值进行采样,那么这意味着大约50%的时间你想得到A,30%的时间是B,20%是时间C.

想象一下你的发行是一根破碎的棍子:

       A         B      C
      0.5       0.3    0.2
|------------|-------|-----|
0           0.5     0.8   1.0

从多项式采样的过程从在0和1之间均匀采样的随机值p开始。然后检查棒p的哪一部分落入,并返回相应的值。 / p>

因此,如果p = 0.7,那么:

       A         B      C
      0.5       0.3    0.2
|------------|-------|-----|
0           0.5     0.8   1.0
                  ^
                 p=0.7

您将返回B

代码方式,然后看起来像:

final def sample[A](dist: Map[A, Double]): A = {
  val p = scala.util.Random.nextDouble
  val it = dist.iterator
  var accum = 0.0
  while (it.hasNext) {
    val (item, itemProb) = it.next
    accum += itemProb
    if (accum >= p)
      return item  // return so that we don't have to search through the whole distribution
  }
  sys.error(f"this should never happen")  // needed so it will compile
}

你可以这样检查:

val dist = Map('A -> 0.5, 'B -> 0.3, 'C -> 0.2)
sample(dist)  // 'A

Vector.fill(1000)(sample(dist)).groupBy(identity).mapValues(_.size) // Map('A -> 510, 'B -> 300, 'C -> 190)

其他事项:

  • 如果dist不是概率分布(即权重不等于1),那么您只需使用p = nextDouble * dist.values.sum。因此,如果dist总和为0.5,那么p将在0.0和0.5之间保持一致;如果它总和为20,那么p将在0.0和20.0之间保持一致。

您可以执行其他优化,例如首先对具有最大概率的条目进行排序,以便在累积p概率质量之前最小化您必须查看的条目数,但这应该可以帮助您完成基本理念。

答案 1 :(得分:0)

我能够调整此Gist https://gist.github.com/anonymous/2033568中的脚本,以便从加权数列表中实现单个加权随机选择。以下是如何实现它的示例:

val range = (min to max)

// use probabilities from the existing dataset to generate numbers
val weightedItems = range.map { number =>
  val w = Probability.pNumber(datasetId, number, i)
  WeightedItem[Int](number, w)
}
val selection = WeightedRandomSelection.singleWeightedSelection(weightedItems)

这里的Gist略有改编:

object WeightedRandomSelection {

  /**
   * Get the number of times an event with probability p occurs in N samples.
   * if R is res, then P(R=n) = p^n q^(N-n) N! / n! / (N-n)!
   * where q = 1-p
   * This has the property that P(R=0) = q^N, and
   * P(R=n+1) = p/q (N-n)/(n+1) P(R=n)
   * Also note that P(R=n+1|R>n) = P(R=n+1)/P(R>n)
   * Uses these facts to work out the probability that the result is zero. If
   * not, then the prob that given that, the result is 1, etc.
   */
  def numEntries(p:Double,N:Int,r:Random) : Int = if (p>0.5) N-numEntries(1.0-p,N,r) else if (p<0.0) 0 else {
    var n = 0
    val q = 1.0-p
    var prstop = Math.pow(q,N)
    var cumulative = 0.0
    while (n<N && (r.nextDouble()*(1-cumulative))>=prstop) {
      cumulative+=prstop
      prstop*=p*(N-n)/(q*(n+1))
      n+=1
    }
    n
  }


  case class WeightedItem[T](item: T, weight: Double)

  /**
   * Compute a weighted selection from the given items.
   * cumulativeSum must be the same length as items (or longer), with the ith element containing the sum of all
   * weights from the item i to the end of the list. This is done in a saved way rather than adding up and then
   * subtracting in order to prevent rounding errors from causing a variety of subtle problems.
   */
  private def weightedSelectionWithCumSum[T](items: Seq[WeightedItem[T]],cumulativeSum:List[Double], numSelections:Int, r: Random) : Seq[T] = {
    if (numSelections==0) Nil
    else {
      val head = items.head
      val nhead = numEntries(head.weight/cumulativeSum.head,numSelections,r)
      List.fill(nhead)(head.item)++weightedSelectionWithCumSum(items.tail,cumulativeSum.tail,numSelections-nhead,r)
    }
  }


  def weightedSelection[T](items: Seq[WeightedItem[T]], numSelections:Int, r: Random): Seq[T] = {
    val cumsum = items.foldRight(List(0.0)){(wi,l)=>(wi.weight+l.head)::l}
    weightedSelectionWithCumSum(items,cumsum,numSelections,r)
  }

  def singleWeightedSelection[T](items: Seq[WeightedItem[T]]): T = {
    val r = new scala.util.Random()
    val numSelections = 1
    val cumsum = items.foldRight(List(0.0)){(wi,l)=>(wi.weight+l.head)::l}
    weightedSelectionWithCumSum(items,cumsum,numSelections,r).head
  }


  def testRandomness[T](items: Seq[WeightedItem[T]], numSelections:Int, r: Random) {
    val runs = 10000
    val indexOfItem = Map.empty++items.zipWithIndex.map{case (item,ind)=>item.item->ind}
    val numItems = items.length
    val bucketSums = new Array[Double](numItems)
    val bucketSumSqs = new Array[Double](numItems)
    for (run<-0 until runs) {
      // compute chi-squared for a run
      val runresult = weightedSelection(items,numSelections,r)
      val buckets = new Array[Double](numItems)
      for (r<-runresult) buckets(indexOfItem(r))+=1
      for (i<-0 until numItems) {
        val count = buckets(i)
        bucketSums(i)+=count
        bucketSumSqs(i)+=count*count
      }
    }
    val sumWeights = items.foldLeft(0.0)(_+_.weight)
    for ((item,ind)<-items.zipWithIndex) {
      val p = item.weight/sumWeights
      val mean = bucketSums(ind)/runs
      val variance = bucketSumSqs(ind)/runs-mean*mean
      val expectedMean = numSelections*p
      val expectedVariance = numSelections*p*(1-p)
      val expectedErrorInMean = Math.sqrt(expectedVariance/runs)
      val text = "Item %10s Mean %.3f Expected %.3f±%.3f Variance %.3f expected %.3f".format(item.item,mean,expectedMean,expectedErrorInMean,variance,expectedVariance)
      println(text)
    }
  }

}