Question

如何创建N＆＃34;随机＆＃34;使用概率表的长度为K的字符串？ K会是偶数。

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}

让我们说K = 6，'acacab'的概率高于'aaaaaa'。

这是我用于基于概率表生成合成序列的更大问题的子问题。我不确定如何使用概率表生成“随机”字符串？

到目前为止我所拥有的：

def seq_prob(fprob_table,K= 6, N= 10):
    #fprob_table is the probability dictionary that you input
    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    #possibly using itertools or random to generate the semi-"random" strings based on the probabilities 
    return seq_list

Answer 1

有一些很好的方法可以用at the end of the documentation for the builtin random module来描述加权随机选择：

一个常见的任务是使用加权概率制作random.choice（）。

如果权重是小整数比，一种简单的技术是建立一个带重复的样本群：

>>> weighted_choices = [('Red', 3), ('Blue', 2), ('Yellow', 1), ('Green', 4)]
>>> population = [val for val, cnt in weighted_choices for i in range(cnt)]
>>> random.choice(population)
'Green'

更通用的方法是使用itertools.accumulate（）将权重排列在累积分布中，然后使用bisect.bisect（）定位随机值：

>>> choices, weights = zip(*weighted_choices)
>>> cumdist = list(itertools.accumulate(weights))
>>> x = random.random() * cumdist[-1]
>>> choices[bisect.bisect(cumdist, x)]
'Blue'

为了使后一种方法适应您的具体问题，我会这样做：

import random
import itertools
import bisect

def seq_prob(fprob_table, K=6, N=10):
    choices, weights = fprob_table.items()
    cumdist = list(itertools.accumulate(weights))

    results = []
    for _ in range(N):
        s = ""
        while len(s) < K:
            x = random.random() * cumdist[-1]
            s += choices[bisect.bisect(cumdist, x)]
        results.append(s)

    return results

这假设概率表中的关键字符串长度相同如果它们有多个不同的长度，则此代码有时（可能大部分时间！）给出的答案长度超过K个字符。我想它也假设K是密钥长度的精确倍数，但如果它不是真的它将实际起作用（它只会给出比K个字符长的结果字符串，因为它是无法准确获得K。

Answer 2

您可以使用random.random：

from random import random
def seq_prob(fprob_table, K=6, N=10):
    #fprob_table is the probability dictionary that you input
    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    s = ""
    while len(seq_list) < N:
        for k, v in fprob_table.items():
            if len(s) == K:
                seq_list.append(s)
                s = ""
                break
            rn = random()
            if rn <=  v:
                s += k
    return seq_list

毫无疑问，这可以改进，但random.random在处理概率时很有用。

Answer 3

我确定有一种清洁 /更好的方式，但这是一种简单的方法。

在这里，我们使用100个单独的字符对值填充pick_list，这些值是由概率确定的值的数量。在这种情况下，'aa'中有20个'ab'，30个'ac'和50个pick_list条目。然后random.choice(pick_list)统一从列表中提取一个随机条目。

import random

prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5}


def seq_prob(fprob_table, K=6, N=10):
    #fprob_table is the probability dictionary that you input

    # fill list with number of items based on the probabilities
    pick_list = []
    for key, prob in fprob_table.items():
        pick_list.extend([key] * int((prob * 100)))    

    #K is the length of the sequence
    #N is the amount of sequences
    seq_list = []
    for i in range(N):
        sub_seq = "".join(random.choice(pick_list) for _ in range(int(K/2)))
        seq_list.append(sub_seq)
    return seq_list

结果：

 seq_prob(prob_table)
['ababac',
 'aaacab',
 'aaaaac',
 'acacac',
 'abacac',
 'acaaac',
 'abaaab',
 'abaaab',
 'aaabaa',
 'aaabaa']

Answer 4

如果您的表或序列很大，使用numpy可能会有所帮助，因为它可能会明显加快。此外，numpy是针对这类问题而构建的，这种方法很容易理解，只有3或4行。

这个想法是将概率转换为累积概率，即将(.2, .5, .3)映射到(.2, .7, 1.)，然后沿着从0到{{1}的平面分布生成随机数}将落入累积和的区间内，其频率对应于权重。 Numpy的1可用于快速找到随机值所在的bin。也就是说，

searchsorted

这里我使用import numpy as np prob_table = {'aa': 0.2, 'ab': 0.3, 'ac': 0.5} N = 10 k = 3 # number of strings (not number of characters) rvals = np.random.random((N, k)) # generate a bunch of random values string_indices = np.searchsorted(np.cumsum(prob_table.values()), rvals) # weighted indices x = np.array(prob_table.keys())[string_indices] # get the strings associated with the indices y = ["".join(x[i,:]) for i in range(x.shape[0])] # convert this to a list of strings # y = ['acabab', 'acacab', 'acabac', 'aaacaa', 'acabac', 'acacab', 'acabaa', 'aaabab', 'abacac', 'aaabab']作为你需要的字符串数，而不是k作为字符数，因为问题陈述对字符串/字符不明确。

生成N＆＃34;随机＆＃34;使用概率表的长度为K的字符串

4 个答案: