I was trying to write a little program that simulates sampling from random numbers in Python3. But it seems to show the opposite of what I intended. What am I doing wrong? It must be extremely easy, but I don't get it.
import random
import statistics
import math
pcounter = 0
counter = 0
for loop in range(1000):
l = []
for x in range(500):
l.append(random.randint(1,1000))
m = statistics.mean(l)
v = list(l)
v[:] = [(x-m)**2 for x in v]
realvariance = sum(v)/len(v)
#print("Real Variance: " + str( sum(v)/len(v)))
#print("Real Mean: " + str(m))
sample = random.sample(l, 10)
v = list(sample)
#print(v)
v[:] = [(x-m)**2 for x in v]
samplem = statistics.mean(sample)
samplebiasedvariance = sum(v)/len(v)
samplevariance = sum(v)/(len(v)-1)
print(samplebiasedvariance)
print(samplevariance)
print(realvariance)
print((samplebiasedvariance - realvariance)**2 < (samplevariance - realvariance)**2)
if (samplebiasedvariance - realvariance)**2 < (samplevariance - realvariance)**2:
pcounter = pcounter + 1
print("biased Variance wins: " + str(pcounter))
else:
counter = counter + 1
print("Variance wins: " + str(counter))
print("biased Variance wins: " + str(pcounter))
print("Variance wins: " + str(counter))
This results in:
biased Variance wins: 563
Variance wins: 437
But it should be the other way around: I would expect the biased Variance to be worse then the unbiased Variance that is calculated using (n-1). Therefore it should be more often closer to the true population Variance (realvariance) then the biased one.
答案 0 :(得分:1)
您所看到的基本上是两个方差估计量的均方误差。 (无论哪个更接近实际值,均方误差都会更小。)事实证明,无偏样本方差的均方误差大于通常的偏差样本方差,后者的均方误差比用1 /计算的样本方差更大。 (n + 1)代替1 / n或1 /(n - 1)。
如果我理解正确,如果你将1 /(n + 1)估算器放入你的程序中,你应该会发现它比其他两个更接近实际值。
在“人口差异和样本差异”标题下的variance维基百科页面上讨论了该主题。毫无疑问,还有很多其他资源。