我有一堆数据点,我希望将Normal分布符合数据。我看到scipy有stats.norm.fit
方法,但这需要一个数据点列表。像
data = [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5]
而我的数据包含在两个列表中,例如。
values = [1, 2, 3, 4, 5]
counts = [4, 3, 6, 1, 3]
如何将正态分布拟合为以这种方式格式化的数据?
答案 0 :(得分:3)
您可以使用numpy.repeat
将值展开为整套,然后使用scipy.stats.norm.fit
:
In [54]: import numpy as np
In [55]: from scipy.stats import norm
In [56]: values = [1, 2, 3, 4, 5]
In [57]: counts = [4, 3, 6, 1, 3]
In [58]: full_values = np.repeat(values, counts)
In [59]: full_values
Out[59]: array([1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5])
In [60]: norm.fit(full_values) # Estimate mean and std. dev.
Out[60]: (2.7647058823529411, 1.3516617991854185)
scipy.stats.norm.fit
计算参数的最大似然估计值。对于正态分布,这些只是样本均值和(有偏差)样本方差的平方根。
据我所知,numpy或scipy中唯一相关的加权统计函数是numpy.average
。您可以使用numpy.average
作为counts
参数,使用weights
自行进行计算。
In [62]: sample_mean = np.average(values, weights=counts)
In [63]: sample_mean
Out[63]: 2.7647058823529411
In [64]: sample_var = np.average((values - sample_mean)**2, weights=counts)
In [65]: sample_var
Out[65]: 1.8269896193771626
In [66]: sample_std = np.sqrt(sample_var)
In [67]: sample_std
Out[67]: 1.3516617991854185
请注意,statistics.stdev
基于无偏样本差异。如果这是你想要的,你可以通过将有偏差的样本方差乘以sum(counts)/(sum(counts) - 1)
来调整缩放:
In [79]: n = sum(counts)
In [80]: sample_var = n/(n-1)*np.average((values - sample_mean)**2, weights=counts)
In [81]: sample_var
Out[81]: 1.9411764705882353
In [82]: sample_std = np.sqrt(sample_var)
In [83]: sample_std
Out[83]: 1.3932610920384718
答案 1 :(得分:1)
从(声称的)正常人群中表征样本的最直接方法是采用其均值和标准差。我们可以使用内置库以及Martelli's lambda to flatten the sample。
>>> values = [1, 2, 3, 4, 5]
>>> counts = [4, 3, 6, 1, 3]
>>> import statistics
>>> sample = [c*[v] for (c, v) in zip(counts, values)]
>>> sample
[[1, 1, 1, 1], [2, 2, 2], [3, 3, 3, 3, 3, 3], [4], [5, 5, 5]]
>>> flatten = lambda l: [item for sublist in l for item in sublist]
>>> sample = flatten(sample)
>>> sample
[1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5]
>>> statistics.mean(sample)
2.764705882352941
>>> statistics.stdev(sample)
1.3932610920384718