将正态分布拟合到加权列表

时间:2017-11-22 17:17:43

标签: python scipy

我有一堆数据点,我希望将Normal分布符合数据。我看到scipy有stats.norm.fit方法,但这需要一个数据点列表。像

这样的东西
data = [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5]

而我的数据包含在两个列表中,例如。

values = [1, 2, 3, 4, 5]
counts = [4, 3, 6, 1, 3]

如何将正态分布拟合为以这种方式格式化的数据?

2 个答案:

答案 0 :(得分:3)

您可以使用numpy.repeat将值展开为整套,然后使用scipy.stats.norm.fit

In [54]: import numpy as np

In [55]: from scipy.stats import norm

In [56]: values = [1, 2, 3, 4, 5]

In [57]: counts = [4, 3, 6, 1, 3]

In [58]: full_values = np.repeat(values, counts)

In [59]: full_values
Out[59]: array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5])

In [60]: norm.fit(full_values)  # Estimate mean and std. dev.
Out[60]: (2.7647058823529411, 1.3516617991854185)

scipy.stats.norm.fit计算参数的最大似然估计值。对于正态分布,这些只是样本均值和(有偏差)样本方差的平方根。 据我所知,numpy或scipy中唯一相关的加权统计函数是numpy.average。您可以使用numpy.average作为counts参数,使用weights自行进行计算。

In [62]: sample_mean = np.average(values, weights=counts)

In [63]: sample_mean
Out[63]: 2.7647058823529411

In [64]: sample_var = np.average((values - sample_mean)**2, weights=counts)

In [65]: sample_var
Out[65]: 1.8269896193771626

In [66]: sample_std = np.sqrt(sample_var)

In [67]: sample_std
Out[67]: 1.3516617991854185

请注意,statistics.stdev基于无偏样本差异。如果这是你想要的,你可以通过将有偏差的样本方差乘以sum(counts)/(sum(counts) - 1)来调整缩放:

In [79]: n = sum(counts)

In [80]: sample_var = n/(n-1)*np.average((values - sample_mean)**2, weights=counts)

In [81]: sample_var
Out[81]: 1.9411764705882353

In [82]: sample_std = np.sqrt(sample_var)

In [83]: sample_std
Out[83]: 1.3932610920384718

答案 1 :(得分:1)

从(声称的)正常人群中表征样本的最直接方法是采用其均值和标准差。我们可以使用内置库以及Martelli's lambda to flatten the sample

>>> values = [1, 2, 3, 4, 5]
>>> counts = [4, 3, 6, 1, 3]
>>> import statistics
>>> sample = [c*[v] for (c, v) in zip(counts, values)]
>>> sample
[[1, 1, 1, 1], [2, 2, 2], [3, 3, 3, 3, 3, 3], [4], [5, 5, 5]]
>>> flatten = lambda l: [item for sublist in l for item in sublist]
>>> sample = flatten(sample)
>>> sample
[1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5]
>>> statistics.mean(sample)
2.764705882352941
>>> statistics.stdev(sample)
1.3932610920384718