Question

想象一下，我们有四个符号 - 'a'，'b'，'c'，'d'。我们在函数输出中出现了这些符号的四个概率 - P1 ， P2 ， P3 ， P4 （其总和等于 1 ）。如何实现一个能够生成这些符号中的三个符号的随机样本的函数，例如，结果符号是否存在于那些指定的概率中？

示例：'a'，'b'，'c'和'd'拥有概率分别为 9/30 ， 8/30 ， 7/30 和 6/30 。该函数输出这四个符号中任意三个符号的各种随机样本：'abc'，'dca'，'bad'等等。我们多次运行此函数，计算在其输出中遇到每个符号的次数。最后，为'a'存储的计数值除以输出的符号总数应该收敛到 9/30 ，对于'b'< / em>至 8/30 ，'c'至 7/30 ，以及'd'至 6/30 。

E.g。该函数生成10个输出：

adc dab bca dab dba cab dcb acd cab abc

其中30个符号包含9个'a'，8个'b'，7个'c'和6个 'd'的。当然，这是一个理想化的例子，因为当样本数量大得多时，这些值只会收敛 - 但它应该有希望传达这个想法。

显然，只有当概率都不大于1/3时才能实现这一切，因为每个单个样本输出总是包含三个不同的符号。如果无法满足所提供的值，函数可以进入无限循环或以其他方式表现不正常。

注意：该功能显然应该使用RNG，否则应该是无状态的。除RNG状态外，每个新调用应独立于以前的任何调用。

编辑：尽管描述中提到选择4个值中的3个，但理想情况下算法应该能够处理任何样本大小。

Answer 1

你的问题不明确。

如果我们为每个允许的三个字母的字符串分配概率，p（abc），p（abd），p（acd）等xtc我们可以生成一系列方程

eqn1: p(abc) + p(abd) + ... others with a "a" ... = p1
  ...
  ...
eqn2: p(abd) + p(acd) + ... others with a "d" ... = p4

这比方程有更多的未知数，有很多方法可以解决它。一旦找到解决方案，无论您选择何种方法（如果您是我，请使用单纯形算法），使用@alestanis描述的轮盘赌方法从每个字符串的概率中进行抽样。

from numpy import *

# using cvxopt-1.1.5
from cvxopt import matrix, solvers 

###########################
# Functions to do some parts

# function to find all valid outputs
def perms(alphabet, length):
    if length == 0:
        yield ""
        return
    for i in range(len(alphabet)):
        val1 = alphabet[i]
        for val2 in perms(alphabet[:i]+alphabet[i+1:], length-1):
            yield val1 + val2


# roulette sampler
def roulette_sampler(values, probs):
    # Create cumulative prob distro
    probs_cum = [sum(probs[:i+1]) for i in range(n_strings)]
    def fun():
        r = random.rand()
        for p,s in zip(probs_cum, values):
            if r < p:
                return s
        # in case of rounding error
        return values[-1]
    return fun


############################
#    Main Part



# create list of all valid strings

alphabet = "abcd"
string_length = 3
alpha_probs = [string_length*x/30. for x in range(9,5,-1)]

# show probs
for a,p in zip(alphabet, alpha_probs):
    print "p("+a+") =",p




# all valid outputs for this particular case
strings = [perm for perm in perms(alphabet, string_length)]
n_strings = len(strings)

# constraints from probabilities p(abc) + p(abd) ... = p(a)
contains = array([[1. if s.find(a) >= 0 else 0. for a in alphabet] for s in strings])
#both = concatenate((contains,wons), axis=1).T # hacky, but whatever
#A = matrix(both)
#b = matrix(alpha_probs + [1.])
A = matrix(contains.T)
b = matrix(alpha_probs)

#also need to constrain to [0,1]
wons = array([[1. for s in strings]])
G = matrix(concatenate((eye(n_strings),wons,-eye(n_strings),-wons)))
h = matrix(concatenate((ones(n_strings+1),zeros(n_strings+1))))

## target matricies for approx KL divergence
# uniform prob over valid outputs
u = 1./len(strings)
P = matrix(eye(n_strings))
q = -0.5*u*matrix(ones(n_strings))
# will minimise p^2 - pq for each p val equally


# Do convex optimisation
sol = solvers.qp(P,q,G,h,A,b)
probs = array(sol['x'])

# Print ouput
for s,p in zip(strings,probs):
    print "p("+s+") =",p
checkprobs = [0. for char in alphabet]
for a,i in zip(alphabet, range(len(alphabet))):
    for s,p in zip(strings,probs):
        if s.find(a) > -1:
            checkprobs[i] += p
    print "p("+a+") =",checkprobs[i]
print "total =",sum(probs)

# Create the sampling function
rndstring = roulette_sampler(strings, probs)


###################
# Verify

print "sampling..."
test_n = 1000
output = [rndstring() for i in xrange(test_n)]

# find which one it is
sampled_freqs = []
for char in alphabet:
    n = 0
    for val in output:
        if val.find(char) > -1:
            n += 1
    sampled_freqs += [n]

print "plotting histogram..."
import matplotlib.pyplot as plt
plt.bar(range(0,len(alphabet)),array(sampled_freqs)/float(test_n), width=0.5)
plt.show()

编辑：Python代码

Answer 2

假设一个单词的长度总是比符号数少一个，那么下面的C＃代码可以完成这项工作：

using System;
using System.Collections.Generic;
using System.Linq;
using MathNet.Numerics.Distributions;

namespace RandomSymbols
{
    class Program
    {
        static void Main(string[] args)
        {
            // Sample case:  Four symbols with the following distribution, and 10000 trials
            double[] distribution = { 9.0/30, 8.0/30, 7.0/30, 6.0/30 };
            int trials = 10000;

            // Create an array containing all of the symbols
            char[] symbols = Enumerable.Range('a', distribution.Length).Select(s => (char)s).ToArray();

            // We're assuming that the word length is always one less than the number of symbols
            int wordLength = symbols.Length - 1;

            // Calculate the probability of each symbol being excluded from a given word
            double[] excludeDistribution = Array.ConvertAll(distribution, p => 1 - p * wordLength);

            // Create a random variable using the MathNet.Numerics library
            var randVar = new Categorical(excludeDistribution);
            var random = new Random();
            randVar.RandomSource = random;

            // We'll store all of the words in an array
            string[] words = new string[trials];

            for (int t = 0; t < trials; t++)
            {
                // Start with a word containing all of the symbols
                var word = new List<char>(symbols);

                // Remove one of the symbols
                word.RemoveAt(randVar.Sample());

                // Randomly permute the remainder
                for (int i = 0; i < wordLength; i++)
                {
                    int swapIndex = random.Next(wordLength);
                    char temp = word[swapIndex];
                    word[swapIndex] = word[i];
                    word[i] = temp;
                }

                // Store the word
                words[t] = new string(word.ToArray());
            }

            // Display words
            Array.ForEach(words, w => Console.WriteLine(w));

            // Display histogram
            Array.ForEach(symbols, s => Console.WriteLine("{0}: {1}", s, words.Count(w => w.Contains(s))));
        }

    }
}

更新：以下是rici概述的方法的C实现。棘手的部分是计算他提到的阈值，我用递归做了。

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

// ****** Change the following for different symbol distributions, word lengths, and number of trials ******
double targetFreqs[] = {10.0/43, 9.0/43, 8.0/43, 7.0/43, 6.0/43, 2.0/43, 1.0/43 };
const int WORDLENGTH = 4;
const int TRIALS = 1000000;
// *********************************************************************************************************

const int SYMBOLCOUNT = sizeof(targetFreqs) / sizeof(double);
double inclusionProbs[SYMBOLCOUNT];

double probLeftToIncludeTable[SYMBOLCOUNT][SYMBOLCOUNT];

// Calculates the probability that there will be n symbols left to be included when we get to the ith symbol.
double probLeftToInclude(int i, int n)
{
    if (probLeftToIncludeTable[i][n] == -1)
    {
        // If this is the first symbol, then the number of symbols left to be included is simply the word length.
        if (i == 0)
        {
            probLeftToIncludeTable[i][n] = (n == WORDLENGTH ? 1.0 : 0.0);
        }
        else
        {
            // Calculate the probability based on the previous symbol's probabilities.
            // To get to a point where there are n symbols left to be included, either there were n+1 symbols left
            // when we were considering that previous symbol and we included it, leaving n,
            // or there were n symbols left and we didn't included it, also leaving n.
            // We have to take into account that the previous symbol may have been manditorily included.
             probLeftToIncludeTable[i][n] = probLeftToInclude(i-1, n+1) * (n == SYMBOLCOUNT-i ? 1.0 : inclusionProbs[i-1])
                + probLeftToInclude(i-1, n) * (n == 0 ? 1.0 : 1 - inclusionProbs[i-1]);
        }
    }
    return probLeftToIncludeTable[i][n];
}

// Calculates the probability that the ith symbol won't *have* to be included or *have* to be excluded.
double probInclusionIsOptional(int i)
{
    // The probability that inclusion is optional is equal to 1.0
    // minus the probability that none of the remaining symbols can be included
    // minus the probability that all of the remaining symbols must be included.
    return 1.0 - probLeftToInclude(i, 0) - probLeftToInclude(i, SYMBOLCOUNT - i);
}

// Calculates the probability with which the ith symbol should be included, assuming that
// it doesn't *have* to be included or *have* to be excluded.
double inclusionProb(int i)
{
    // The following is derived by simple algebra:
    // Unconditional probability = (1.0 * probability that it must be included) + (inclusionProb * probability that inclusion is optional)
    // therefore...
    // inclusionProb = (Unconditional probability - probability that it must be included) / probability that inclusion is optional
    return (targetFreqs[i]*WORDLENGTH - probLeftToInclude(i, SYMBOLCOUNT - i)) / probInclusionIsOptional(i);
}

int main(int argc, char* argv[])
{
    srand(time(NULL));

    // Initialize inclusion probabilities
    for (int i=0; i<SYMBOLCOUNT; i++)
        for (int j=0; j<SYMBOLCOUNT; j++)
            probLeftToIncludeTable[i][j] = -1.0;

    // Calculate inclusion probabilities
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        inclusionProbs[i] = inclusionProb(i);
    }

    // Histogram
    int histogram[SYMBOLCOUNT];
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        histogram[i] = 0;
    }

    // Scratchpad for building our words
    char word[WORDLENGTH+1];
    word[WORDLENGTH] = '\0';

    // Run trials
    for (int t=0; t<TRIALS; t++)
    {
        int included = 0;

        // Build the word by including or excluding symbols according to the problem constraints
        // and the probabilities in inclusionProbs[].
        for (int i=0; i<SYMBOLCOUNT && included<WORDLENGTH; i++)
        {
            if (SYMBOLCOUNT - i == WORDLENGTH - included // if we have to include this symbol
                || (double)rand()/(double)RAND_MAX < inclusionProbs[i]) // or if we get a lucky roll of the dice
            {
                word[included++] = 'a' + i;
                histogram[i]++;
            }
        }

        // Randomly permute the word
        for (int i=0; i<WORDLENGTH; i++)
        {
            int swapee = rand() % WORDLENGTH;
            char temp = word[swapee];
            word[swapee] = word[i];
            word[i] = temp;
        }

        // Uncomment the following to show each word
        // printf("%s\r\n", word);
    }

    // Show the histogram
    for (int i=0; i<SYMBOLCOUNT; i++)
    {
        printf("%c: target=%d, actual=%d\r\n", 'a'+i, (int)(targetFreqs[i]*WORDLENGTH*TRIALS), histogram[i]);
    }

    return 0;
}

Answer 3

我认为这是一个非常有趣的问题。我不知道一般的解决方案，但是在大小为n-1的样本（如果有解决方案）的情况下很容易解决，因为恰好有n个可能的样本，每个样本对应于没有一个样本元素。

假设我们正在寻找F _a = 9/30，F _b = 8/30，F _c = 7/30，F在大小为4的样本中，_d = 6/30，如在OP中那样。我们可以通过选择不包含给定对象的样本将每个频率直接转换为样本频率。例如，我们希望9/30的所选对象为a;我们在样本中不能有多个a，并且我们在样本中总是有三个符号;因此，9/10的样本必须包含a，而1/10不能包含a。但是只有一个可能的样本不包含a：bcd。所以10％的样本必须是bcd。同样，20％必须是acd; 30％abd和40％abc。（或者，更一般地说，F _ā = 1 - （n-1）F _a其中F _ā是（唯一的）频率样本不包括a）

我不禁想到这种观察结合生成独特样本的经典方法之一可以解决一般问题。但我没有那个解决方案。对于它的价值，我想到的算法如下：

To select a random sample of size k out of a universe U of n objects:
1) Set needed = k; available = n.
2) For each element in U, select a random number in the range [0, 1).
3) If the random number is less than k/n:
     3a) Add the element to the sample.
     3b) Decrement needed by 1. If it reaches 0, we're finished.
4) Decrement available, and continue with the next element in U.

所以我的想法是，应该可以通过在步骤3中更改阈值来操纵元素的频率，使其成为相应元素所需频率的函数。

Answer 4

要做到这一点，你必须使用一个临时数组来存储概率的累计和。

在您的示例中，概率分别为9 / 30,8 / 30,7 / 30和6/30。然后你应该有一个数组：

values = {'a', 'b', 'c', 'd'}
proba = {9/30, 17/30, 24/30, 1}

然后您在r中选择一个随机数[0, 1]并执行以下操作：

char chooseRandom() {
    int i = 0;
    while (r > proba[i])
        ++i; 

    return values[i];
}

具有指定结果概率的值的随机样本

4 个答案: