Question

我仍在使用python进行编程。这是我第一次尝试使用直方图和拟合！

特别是，我有一个数据集，并对其做了直方图。在这一点上，我应该进行瑞利拟合，但是我无法找出正确设置参数的正确方法。我读到了loc和scale，应该将fit的参数通常设置为0和1。显然，这种方式的拟合效果不好！！！有没有人可以帮助我？为了清楚起见，我要附加我正在使用的代码。

谢谢。

pipeline.set_params(hyperparameter=1, ...)

我的数据（fondi）是：[13 15 13 14 12 13 12 14 15 12 11 10 11 15 18 11 11 11 13 15 15 15 15 12 12 13 12 15 15 15 12 12 11 14 16 11 13 14 16 17 24 21 16 20 18 18 19 21 22 19 15 16 15 13 14 16 18 21 19 22 14 13 14 15 14 17 19 17 16 18 12 15 17 17 16 17 16 19 17 14 13 16 16 13 15 17 17 20 18 17 12 19 14 15 15 14 13 17 16 14 12 11 12 20 19 16 24 19 20 19 17 16 17 16 19 22 17 16 20 22 21 22 20 14 18 16 19 20 17 20 22 20 22 19 17 13 16 18 14 16 20 20 18 19 19 16 19 12 12 14 14 13 15 16 16 19 16 17 12 11 11 10 12 11 11 13 14 13 17 8 8 8 10 10 10 14 16 11 9 9 11 10 17 13 15 19 15 13 16 17 14 12 13 14 11 10 15 13 12 12 11 10 9 9 9 9 8 15 16 12 9 11 9 10 10 7 7 7 21 19 13 10 15 12 10 10 9 8 10 20 14 13 11 13 15 14 10 11 12 16 17 15 12 13 16 15 13 14 17 14 13 15 13 11 14 15 17 18 22 22 16 16 17 22 17 17 18 26 17 19 21 16 15 19 19 22 19 18 17 18 18 12 17 17 17 18 18 14 16 20 17 16 16 18 16 19 18 18 20 20]

输出：loc = 6.783540954380711 scale = 6.430045149216335

Answer 1

调整MCVE

下面是一个简单的过程，可从Rayleigh distribution提取试验数据集，然后使用Maximum Likelihood Estimation方法提供的scipy.stats.rv_continuous.fit查找其参数：

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Create a Continuous Variable: 
X = stats.rayleigh(loc=10, scale=5)

# Sample from this Random variable
x0 = X.rvs(size=10000, random_state=123)

# Adjust Distribution parameters
loc, scale = stats.rayleigh.fit(x0) # (9.990726961181025, 4.9743913760956335)

# Tabulate over sample range (PDF display):
xl = np.linspace(x0.min(), x0.max(), 100)

# Display Results:
fig, axe = plt.subplots()
axe.hist(x0, density=1, label="Sample")
axe.plot(xl, X.pdf(xl), label="Exact Distribution")
axe.plot(xl, stats.rayleigh(scale=scale, loc=loc).pdf(xl), label="Adjusted Distribution")
axe.set_title("Distribution Fit")
axe.set_xlabel("Variable, $x$ $[\mathrm{AU}]$")
axe.set_ylabel("Density, $f(x)$ $[\mathrm{AU}^{-1}]$")
axe.legend()
axe.grid()

它呈现如下：

注释

我想提请您注意以下要点：

300对于直方图分类箱来说是一个巨大的数字，因为您将拥有空的或低填充的分类箱，这会降低表示的质量。由于代表下的垃圾箱，它也可能使统计检验（例如“卡方拟合度”）失败。您当然可以让matplotlib估算垃圾箱的数量；
分布通常采用位置和比例参数，在scipy.stats中，它们会尽一切可能以这种方式对每个可用分布进行规范化。要找出与usual parametric distribution definition的对应关系，您需要解决以下问题：pdf(x) = pdf(y)/scale其中y = (x-loc)/scale。在这种情况下，您将看到scale参数等效于sigma，并且这对于原点偏移是不变的（不依赖于loc值）；
要调整分布，您需要在某些时候执行一些分析/统计过程以从采样数据中估计参数。您的代码中缺少此部分（请参见上面的MCVE中的stats.rayleigh.fit(x0)）。这部分独立于matplotlib绘制的任何图形，由scipy处理，该图形在完整数据集上执行MLE（这就是为什么更改bin只会影响直方图显示而没有其他影响的原因。） / li>

更新

根据您的帖子更新，我完成了我的回答。使用您提供的数据集：

x0 = np.array([13, 15, 13, 14, 12, 13, 12, 14, 15, 12, 11, 10, 11, 15, 18, 11, 11, 11, 13,
               15, 15, 15, 11, 12, 13, 12, 15, 15, 15, 12, 12, 11, 14, 16, 11, 13, 14, 16,
               17, 24, 21, 16, 20, 18, 18, 19, 21, 22, 19, 15, 16, 15, 13, 14, 16, 18, 21,
               19, 22, 14, 13, 14, 15, 14, 17, 19, 17, 16, 18, 12, 15, 17, 17, 16, 17, 16,
               19, 17, 14, 13, 16, 16, 13, 15, 17, 17, 20, 18, 17, 12, 19, 14, 15, 15, 14,
               13, 17, 16, 14, 12, 11, 12, 20, 19, 16, 24, 19, 20, 19, 17, 16, 17, 16, 19,
               22, 17, 16, 20, 22, 21, 22, 20, 14, 18, 16, 19, 20, 17, 20, 22, 20, 22, 19,
               17, 13, 16, 18, 14, 16, 20, 20, 18, 19, 19, 16, 19, 12, 12, 14, 14, 13, 15,
               16, 16, 19, 16, 17, 12, 11, 11, 10, 12, 11, 11, 13, 14, 13, 17, 8, 8, 8, 10,
               10, 10, 14, 16, 11, 9, 9, 11, 10, 17, 13, 15, 19, 15, 13, 16, 17, 14, 12, 13,
               14, 11, 10, 15, 13, 12, 12, 11, 10, 9, 9, 9, 9, 8, 15, 16, 12, 9, 11, 9, 10,
               10, 7, 7, 7, 21, 19, 13, 10, 15, 12, 10, 10, 9, 8, 10, 20, 14, 13, 11, 13, 15,
               14, 10, 11, 12, 16, 17, 15, 12, 13, 16, 15, 13, 14, 17, 14, 13, 15, 13, 11, 14,
               15, 17, 18, 22, 21, 16, 17, 22, 17, 17, 18, 26, 17, 19, 21, 16, 15, 19, 19, 22,
               19, 18, 17, 18, 18, 12, 17, 17, 17, 18, 14, 16, 20, 17, 16, 16, 18, 16, 19, 18,
               18, 20, 18])

我们可以尝试调整瑞利分布：

p = stats.rayleigh.fit(x0)
X = stats.rayleigh(*p)

从视觉上看，合身度不是很好：

让我们通过统计测试进行确认。首先，我们可以使用ECDF Kolmogorov-Smirnov检查Test是否与调整后的分布的CDF兼容：

kst = stats.kstest(x0, X.cdf)
# KstestResult(statistic=0.12701044409231593, pvalue=0.0001232197856051324)

我们还可以评估调整后的分布的预期计数，并使用Chi Square Test将它们与有礼貌的计数进行比较：

c, b = np.histogram(x0)
ct = np.diff(X.cdf(b))*np.sum(c)
c2t = stats.chisquare(c, ct, ddof=2)
# Power_divergenceResult(statistic=31.874916914227434, pvalue=4.284273564311872e-05)

自由度的差等于2，因为除了卡方统计量外，我们还必须估计瑞利分布的loc和scale参数（因此在测试中为ddof=2致电）。

这两个测试的p-value都非常低且相似，这意味着很难满足零假设（因此告诉我们应该拒绝它们）：

Kolmogorov： H0 =样本是从参考分布中提取的；
卡方： H0 =类别在观察分布和预期分布之间没有差异；

那么很难相信您的数据集来自调整后的瑞利分布。

您可以将这些结果与MCVE中绘制的综合数据进行比较，测试返回的p值会超过10％：

# KstestResult(statistic=0.0097140857969642, pvalue=0.3019167138216704)
# Power_divergenceResult(statistic=11.170065854104491, pvalue=0.13137094282775724)

在这种情况下我们无法拒绝H0，我们相信采样数据可能来自调整后的瑞利分布。

蟒蛇：瑞利飞度（直方图）

1 个答案:

调整MCVE

注释

更新