第一次使用PyMc失败

时间:2016-03-31 14:44:56

标签: pymc

我是PyMc的新手,想知道为什么这段代码不起作用。我已经花了好几个小时,但我想念一些东西。有谁可以帮助我?

我想解决的问题:

  • 我有一组显示3个凸起的Npts测量值,所以我想将其建模为3个高斯的总和(假设测量值很大且高斯近似值正常)==>我想估计8个参数:凸起的相对权重(即2个参数),它们的3个均值和它们的3个方差。

  • 我希望这种方法足够广泛,适用于可能没有相同凸起的其他装置,所以我采用宽松的平底座。

问题: 我下面的代码给了我糟糕的估计。怎么了? THX

"""
hypothesis: multimodal distrib sum of 3 gaussian distributions

model description:
* p1, p2, p3 are the probabilities for a point to belong to gaussian 1, 2 or 3
 ==> p1, p2, p3 are the relative weights of the 3 gaussians

* once a point is associated with a gaussian,
it is distributed normally according to the parameters mu_i, sigma_i of the gaussian
but instead of considering sigma, pymc prefers considering tau=1/sigma**2

* thus, PyMc must guess 8 parameters: p1, p2, mu1, mu2, mu3, tau1, tau2, tau3

* priors on p1, p2 are flat between 0.1 and 0.9 ==> 'pm.Uniform' variables
with the constraint p2<=1-p1. p3 is deterministic ==1-p1-p2

* the 'assignment' variable assigns each point to a gaussian, according to probabilities p1, p2, p3

* priors on mu1, mu2, mu3 are flat between 40 and 120 ==> 'pm.Uniform' variables

* priors on sigma1, sigma2, sigma3 are flat between 4 and 12 ==> 'pm.Uniform' variables
"""

    import numpy as np
    import pymc as pm

    data = np.loadtxt('distrib.txt')
    Npts = len(data)

    mumin = 40
    mumax = 120
    sigmamin=4
    sigmamax=12

    p1 = pm.Uniform("p1",0.1,0.9)
    p2 = pm.Uniform("p2",0.1,1-p1)
    p3 = 1-p1-p2
    assignment = pm.Categorical('assignment',[p1,p2,p3],size=Npts)
    mu = pm.Uniform('mu',[mumin,mumin,mumin],[mumax,mumax,mumax])
    sigma = pm.Uniform('sigma',[sigmamin,sigmamin,sigmamin],
                       [sigmamax,sigmamax,sigmamax])
    tau = 1/sigma**2

    @pm.deterministic
    def assign_mu(assi=assignment,mu=mu):
        return mu[assi]

    @pm.deterministic
    def assign_tau(assi=assignment,sig=tau):
        return sig[assi]

    hypothesis = pm.Normal("obs", assign_mu, assign_tau, value=data, observed=True)

    model = pm.Model([hypothesis, p1, p2, tau, mu])
    test = pm.MCMC(model)
    test.sample(50000,burn=20000) # conservative values, let's take a coffee... 

    print('\nguess\n* p1, p2 = ',
           np.mean(test.trace('p1')[:]),' ; ',
           np.mean(test.trace('p2')[:]),' ==> p3 = ',
           1-np.mean(test.trace('p1')[:])-np.mean(test.trace('p2')[:]),
           '\n* mu = ',
           np.mean(test.trace('mu')[:,0]),' ; ',
           np.mean(test.trace('mu')[:,1]),' ; ',
           np.mean(test.trace('mu')[:,2]))

    print('why does this guess suck ???!!!')      

我可以发送数据文件&#39; distrib.txt&#39;。它是~500 kb,数据如下。例如,上次运行给了我:

p1, p2 = 0.366913192214  ;  0.583816452532  ==> p3 = 0.04927035525400003
mu =  77.541619286  ;  75.3371615466  ;  77.2427165073

虽然在~55,~75和~90附近有明显的碰撞,概率约为~0.2,~0.5和~0.3

histogram of the data (20k points == Npts)

1 个答案:

答案 0 :(得分:1)

您遇到此处描述的问题:Negative Binomial Mixture in PyMC

问题是分类变量收敛得太慢,三个分量分布才变得更近。

首先,我们生成您的测试数据:

function confirm_leaving(){
    return confirm("Do you really wish to leave this website?");
}

然后我们绘制直方图,由后组分配着色:

data1 = np.random.normal(55,5,2000)
data2 = np.random.normal(75,5,5000)
data3 = np.random.normal(90,5,3000)
data=np.concatenate([data1, data2, data3])
np.savetxt("distrib.txt", data)

Stacked histogram with poor discrimination between clusters 这最终会收敛,但不足以对你有用。

您可以通过在启动MCMC之前猜测赋值来解决此问题:

tablebyassignment = [data[np.nonzero(np.round(test.trace("assignment")[:].mean(axis=0)) == i)] for i in range(0,3) ]
plt.hist(tablebyassingment, bins=30, stacked = True)

哪个给你: Stacked histogram showing separation of three lumps 使用k-means来初始化分类可能不会一直有效,但它优于不收敛。