我最近开始使用python的统计模块。
我注意到默认情况下,variance()方法会返回'无偏见的'方差或样本方差:
import statistics as st
from random import randint
def myVariance(data):
# finds the variance of a given set of numbers
xbar = st.mean(data)
return sum([(x - xbar)**2 for x in data])/len(data)
def myUnbiasedVariance(data):
# finds the 'unbiased' variance of a given set of numbers (divides by N-1)
xbar = st.mean(data)
return sum([(x - xbar)**2 for x in data])/(len(data)-1)
population = [randint(0, 1000) for i in range(0,100)]
print myVariance(population)
print myUnbiasedVariance(population)
print st.variance(population)
输出:
81295.8011
82116.9708081
82116.9708081
这对我来说似乎很奇怪。我猜很多时候人们正在使用样本,所以他们想要样本方差,但我希望默认函数能够计算总体方差。有谁知道这是为什么?
答案 0 :(得分:1)
我认为,几乎所有人都会估算出与样本一起使用的数据的差异。并且,根据无偏估计的定义,方差的无偏估计的期望值等于总体方差。
在您的代码中,您使用random.randint(0, 1000)
,其中来自离散均匀分布的样本具有1001个可能的值,方差1000 * 1002/12 = 83500(参见,例如,MathWorld)。这里的代码显示,平均而言,当使用样本作为输入时,statistics.variance()
比statistics.pvariance()
更接近人口差异:
import statistics as st, random, numpy as np
var, pvar = [], []
for i in range(10000):
smpl = [random.randint(0, 1000) for j in range(10)]
var.append(st.variance(smpl))
pvar.append(st.pvariance(smpl))
print "mean variance(sample): %.1f" %np.mean(var)
print "mean pvariance(sample): %.1f" %np.mean(pvar)
print "pvariance(population): %.1f" %st.pvariance(range(1001))
此处示例输出:
mean variance(sample): 83626.0
mean pvariance(sample): 75263.4
pvariance(population): 83500.0
答案 1 :(得分:-2)
这是另一篇很棒的文章。我想知道完全相同的事情,对此的答案确实为我清除了它。使用np.var,您可以向其添加“ ddof = 1”的arg以返回无偏估计量。检查一下:
What is the difference between numpy var() and statistics variance() in python?
print(np.var([1,2,3,4],ddof=1))
1.66666666667