我正在尝试将一些数据拟合到对数正态分布,并使用优化的参数生成随机对数正态分布。 经过一番搜索,我发现了一些解决方案,但没有一个令人信服:
使用fit函数solution1:
import numpy as np
from scipy.stats import lognorm
mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354]
shape, loc, scale = lognorm.fit(mydata)
rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)
使用来自原始数据的mu和sigma 或解决方案2:
import numpy as np
from scipy.stats import lognorm
mydata = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354]
mu = np.mean([np.log(i) for i in mydata])
sigma = np.std([np.log(i) for i in mydata])
distr = lognorm(mu, sigma)
rnd_log = distr.rvs (size=100)
这些解决方案都不合适:
import pylab
pylab.plot(sorted(mydata, reverse=True), 'ro')
pylab.plot(sorted(rnd_log, reverse=True), 'bx')
我不确定我是否理解使用发行版的方式,或者我是否遗漏了其他内容......
我虽然在这里找到解决方案:Does anyone have example code of using scipy.stats.distributions? 但我无法从我的数据中得到形状...我在使用fit函数时遗漏了一些东西吗?
感谢
修改
这是一个示例,以便更好地了解我的问题:
print 'solution 1:'
means = []
stdes = []
distr = lognorm(mu, sigma)
for _ in xrange(1000):
rnd_log = distr.rvs (size=100)
means.append (np.mean([np.log(i) for i in rnd_log]))
stdes.append (np.std ([np.log(i) for i in rnd_log]))
print 'observed mean:',mu , 'mean simulated mean:', np.mean (means)
print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)
print '\nsolution 2:'
means = []
stdes = []
shape, loc, scale = lognorm.fit(mydata)
for _ in xrange(1000):
rnd_log = lognorm.rvs (shape, loc=loc, scale=scale, size=100)
means.append (np.mean([np.log(i) for i in rnd_log]))
stdes.append (np.std ([np.log(i) for i in rnd_log]))
print 'observed mean:',mu , 'mean simulated mean:', np.mean (means)
print 'observed std :',sigma, 'mean simulated std :', np.mean (stdes)
结果是:
solution 1:
observed mean: 1.82562655734 mean simulated mean: 1.18929982267
observed std : 1.39003773799 mean simulated std : 0.88985924363
solution 2:
observed mean: 1.82562655734 mean simulated mean: 4.50608084668
observed std : 1.39003773799 mean simulated std : 5.44206119499
同时,如果我在R中做同样的事情:
mydata <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354)
meanlog <- mean(log(mydata))
sdlog <- sd(log(mydata))
means <- c()
stdes <- c()
for (i in 1:1000){
rnd.log <- rlnorm(length(mydata), meanlog, sdlog)
means <- c(means, mean(log(rnd.log)))
stdes <- c(stdes, sd(log(rnd.log)))
}
print (paste('observed mean:',meanlog,'mean simulated mean:',mean(means),sep=' '))
print (paste('observed std :',sdlog ,'mean simulated std :',mean(stdes),sep=' '))
我得到:
[1] "observed mean: 1.82562655733507 mean simulated mean: 1.82307191072317"
[1] "observed std : 1.39704049131865 mean simulated std : 1.39736545866904"
更接近,所以我猜我在使用scipy时做错了...
答案 0 :(得分:4)
scipy中的对数正态分布参数化与通常的方法略有不同。请参阅scipy.stats.lognorm
文档,尤其是&#34; Notes&#34;部分。
以下是如何获得您期望的结果(请注意,我们在安装时将位置保持为0):
In [315]: from scipy import stats
In [316]: x = np.array([1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,6,6,6,6,6,7,7,7,8,8,8,8,8,9,9,9,10,10,11,12,13,14,14,15,19,19,21,23,25,27,28,30,31,36,41,45,48,52,55,60,68,75,86,118,159,207,354])
In [317]: mu, sigma = stats.norm.fit(np.log(x))
In [318]: mu, sigma
Out[318]: (1.8256265573350701, 1.3900377379913127)
In [319]: shape, loc, scale = stats.lognorm.fit(x, floc=0)
In [320]: np.log(scale), shape
Out[320]: (1.8256267737298788, 1.3900309739954713)
现在您可以生成样本并确认您的期望:
In [321]: dist = stats.lognorm(shape, loc, scale)
In [322]: means, sds = [], []
In [323]: for i in xrange(1000):
.....: sample = dist.rvs(size=100)
.....: logsample = np.log(sample)
.....: means.append(logsample.mean())
.....: sds.append(logsample.std())
.....:
In [324]: np.mean(means), np.mean(sds)
Out[324]: (1.8231068508345041, 1.3816361818739145)