Question

我希望能够生成具有来自绘制曲线的概率密度函数的随机数。下面这两个在曲线下面积相同，但应该产生具有不同特征的随机数列表。

我的直觉是，做到这一点的一种方法是对曲线进行采样，然后使用这些矩形的区域来提供np.random.choice以选择范围以在该范围内执行普通随机矩形的范围。

这不是一种非常有效的方法。有没有更“正确”的方法呢？

实际上我真的很开心：

import matplotlib.pyplot as plt
import numpy as np

areas = [4.397498, 4.417111, 4.538467, 4.735034, 4.990129, 5.292455, 5.633938,
         6.008574, 6.41175, 5.888393, 2.861898, 2.347887, 2.459234, 2.494357,
         2.502986, 2.511614, 2.520243, 2.528872, 2.537501, 2.546129, 7.223747,
         7.223747, 2.448148, 1.978746, 1.750221, 1.659351, 1.669999]
divisons = [0.0, 0.037037, 0.074074, 0.111111, 0.148148, 0.185185, 0.222222,
            0.259259, 0.296296, 0.333333, 0.37037, 0.407407, 0.444444, 0.481481,
            0.518519, 0.555556, 0.592593, 0.62963, 0.666667, 0.703704, 0.740741,
            0.777778, 0.814815, 0.851852, 0.888889, 0.925926, 0.962963, 1.0]
weights = [a/sum(areas) for a in areas]
indexes = np.random.choice(range(len(areas)), 50000, p=weights)
samples = []
for i in indexes:
    samples.append(np.random.uniform(divisons[i], divisons[i+1]))

binwidth = 0.02
binSize = np.arange(min(samples), max(samples) + binwidth, binwidth)
plt.hist(samples, bins=binSize)
plt.xlim(xmax=1)
plt.show()

该方法似乎有效，但有点重！

Answer 1

对于您的情况，似乎基于直方图的方法肯定是最简单的，因为您有一条用户绘制的线。

但是，由于您只是尝试从该分布生成随机数，您可以使用归一化的y值（将所有像素的y位置相加并除以总数）作为函数中的probability_distribution直接使用在下面，只需获取用户绘制的像素数的大小。

from numpy.random import choice
pde = choice(list_of_candidates, number_of_items_to_pick, p=probability_distribution)

probability_distribution（标准化像素y值）是与list_of_candidates（关联的x值）的顺序相同的序列。您还可以使用关键字replace = False来更改行为，以便不替换绘制的项目。

see numpy docs here

这应该快得多，因为你实际上并没有生成整个pde，只是绘制与pde匹配的随机数。

编辑：您的更新看起来像一个可靠的方法。如果你想生成pde，你可以考虑调查numba（http://numba.pydata.org）来矢量化你的for循环。

Answer 2

一种方法是使用 scipy.stats 中的 rv_continuous 。直接的开始方式是使用带有 rv_continuous 的样条曲线来近似其中一个pdf。实际上，您可以通过使用此事件定义pdf或cdf来生成伪随机偏差。

Answer 3

另一种方法是采样CDF的逆。然后使用均匀随机生成器在逆CDF的x轴上生成p值，以生成PDF的随机绘制。请参阅此文章：http://matlabtricks.com/post-44/generate-random-numbers-with-a-given-distribution

Answer 4

我在使用rv_continuous时遇到了麻烦，因此我制定了自己的例行程序，可以通过紧凑的支持（例如，来自两个指数的总和，或者来自任何已知的离散pdf（如问题中所述）。这本质上是@Jan的解决方案（非常经典的解决方案）。

我的代码是完全独立的。要使其适应任何其他分布，只需更改unnromalized_pdf中的公式，并确保正确设置了支持范围（在我的情况下，从0到10 / lambda_max就足够了。

import numpy as np
import matplotlib.pyplot as plt
plt.ion()

## the funciton mayb be any function, so long as it is
## with FINITE Support
def unnromalized_pdf(T, lambda1, intercept1, lambda2, intercept2):
    return (np.exp(-lambda1*T-intercept1) + np.exp(-lambda2*T-intercept2))
lambda1, intercept1, lambda2, intercept2 =  0.0012941708402716523 , 8.435217547457713 , 0.0063804460354380385 , 6.712937938322769

## defining the support of the pdf by hand
x0 = 0
xmax = max(1/lambda1, 1/lambda2)*10

## the more bins, the higher the precision
Nbins = 1000000
xs = np.linspace(x0, xmax, Nbins)
dx = xs[1]-xs[0]
## other way to specify it:
# dx = min(1/lambda1, 1/lambda2)/100
# xs = np.arange(x0, xmax, dx)

## compute the (approximate) pdf and cdf of the thing to sample:
pdf = unnromalized_pdf(xs, lambda1, intercept1, lambda2, intercept2)
normalized_pdf = pdf/pdf.sum()
cdf = np.cumsum(normalized_pdf)

## sampling from the distro
Nsamples=100000
r = np.random.random(Nsamples)
indices_in_cdf = np.searchsorted(cdf, r)
values_drawn = xs[indices_in_cdf]
histo, bins = np.histogram(values_drawn, 1000, density=True)
plt.semilogy(bins[:-1], histo, label='drawn from distro', color='blue')
plt.semilogy(xs, normalized_pdf/dx, label ="exact pdf from which we sample", color='k', lw=3)
plt.legend()

从任意概率密度函数生成随机数

4 个答案: