Scipy rv_continuous错误地从分布生成样本

时间:2018-06-14 12:46:05

标签: python random scipy statistics

from scipy import stats
import numpy as np 

class your_distribution(stats.rv_continuous):
    def _pdf(self, x):
        p0 = 10.9949
        p1 = 0.394447
        p2 = 12818.4
        p3 = 2.38898

        return ((p1*p3)/(p3*p0+p2*p1))*((p0*np.exp(-1.0*p1*x))+(p2*np.exp(-1.0*p3*x)))

distribution = your_distribution(a=0.15, b=10.1)
sample = distribution.rvs(size=50000)

以上代码从标准化pdf生成50000个样本,范围为0.15到10.1。但是,在上限b=10.1生成的样本数量不成比例。这是没有意义的,正如绘制pdf时所见。

我该如何解决这个问题?

1 个答案:

答案 0 :(得分:2)

PDF已针对整个分发范围正确标准化。但是,设置ab只会在不进行任何重新规范化的情况下剪切PDF。使用(a=0.15, b=10.1),PDF不再集成到1,并且通过scipy实现的怪癖,显然在范围的末尾添加了剩余密度。这会导致大量样本处于上限。

我们可以通过绘制a = 0和a = 0.15的累积密度函数(CDF)来可视化正在发生的事情:

x = np.linspace(0, 15, 1000)

distribution = your_distribution(a=0.0, b=10.1)
plt.plot(x, distribution.cdf(x), label='a=0')

distribution = your_distribution(a=0.15, b=10.1)
plt.plot(x, distribution.cdf(x), label='a=0.15')

plt.legend()

enter image description here

为了摆脱CDF中的跳跃和上限范围内的虚假样本,我们需要重新规范化a..b范围内的PDF。我懒得分析地找出正确的因素,所以让scipy去做艰苦的工作:

from scipy import stats
from scipy.integrate import quad
import numpy as np

# I pulled the definition of the PDF out of the class so we can use it to
# compute the scale factor.
def pdf(x):
    p0 = 10.9949
    p1 = 0.394447
    p2 = 12818.4
    p3 = 2.38898

    return ((p1*p3)/(p3*p0+p2*p1))*((p0*np.exp(-1.0*p1*x))+(p2*np.exp(-1.0*p3*x)))    

class your_distribution(stats.rv_continuous):        
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        # integrate area of the PDF in range a..b
        self.scale, _ = quad(pdf, self.a, self.b)

    def _pdf(self, x):
        return pdf(x) / self.scale  # scale PDF so that it integrates to 1 in range a..b 

distribution = your_distribution(a=0.15, b=10.1)
sample = distribution.rvs(size=1000)

如果您碰巧知道积分的解析解,则可以使用它而不是调用quad