Question

我试图沿着中心极限数据分布获得正态分布曲线。

下面是我尝试过的实现。

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):#roll dice 10 times for n times
    a = np.random.randint(1,7,10)#roll dice 10 times from 1 to 6 & capturing each event
    avg.append(np.average(a))#find average of those 10 times each time

plt.hist(avg[0:])

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Plot the distribution curve
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2)))

我得到下图，

您可以在底部的红色处看到法线。

谁能告诉我为什么曲线不拟合？

Answer 1

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import math

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):#roll dice 10 times for n times
    a = np.random.randint(1,7,10)#roll dice 10 times from 1 to 6 & capturing each event
    avg.append(np.average(a))#find average of those 10 times each time

plt.hist(avg[0:],20,normed=True)

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Plot the distribution curve
plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) *np.exp( - (bins - mu)**2 / (2 * sigma**2)))

我刚刚缩小了平均列表直方图的比例。

情节：-

Answer 2

您几乎拥有了！首先，请注意您正在同一根轴上绘制两个直方图：

plt.hist(avg[0:])

和

plt.hist(s, 20, normed=True)

因此您可以在直方图上绘制法线密度，并使用normed=True参数正确地标准化了第二个图。但是，您也忘记了标准化第一个直方图（plt.hist(avg[0:]), normed=True）。

我还建议您既然已经导入了scipy.stats，那么您最好也使用该模块中的正态分布，而不要自己编写pdf。

将所有内容放在一起：

import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats

# 1000 simulations of die roll
n = 10000

avg = []
for i in range(1,n):
    a = np.random.randint(1,7,10)
    avg.append(np.average(a))

# CHANGED: normalise this histogram too
plt.hist(avg[0:], 20, normed=True)

zscore = stats.zscore(avg[0:])

mu, sigma = np.mean(avg), np.std(avg)
s = np.random.normal(mu, sigma, 10000)

# Create the bins and histogram
count, bins, ignored = plt.hist(s, 20, normed=True)

# Use scipy.stats implementation of the normal pdf
# Plot the distribution curve
x = np.linspace(1.5, 5.5, num=100)
plt.plot(x, stats.norm.pdf(x, mu, sigma))

哪个给了我以下情节：

编辑

在您询问的评论中：

我如何在np.linspace中选择1.5和5.5
是否可以在未归一化的直方图上绘制正常核？

地址q1。首先，我选择了1.5和5.5。绘制直方图后，我看到直方图箱看起来在1.5到5.5之间，这就是我们要绘制正态分布的范围。

选择此范围的一种更具编程性的方式是：

x = np.linspace(bins.min(), bins.max(), num=100)

对于问题2，是的，我们可以实现您想要的。但是，您应该知道，我们将不再绘制概率密度函数。

在绘制直方图时删除了normed=True自变量之后：

x = np.linspace(bins.min(), bins.max(), num=100)

# Find pdf of normal kernel at mu
max_density = stats.norm.pdf(mu, mu, sigma)
# Calculate how to scale pdf
scale = count.max() / max_density

plt.plot(x, scale * stats.norm.pdf(x, mu, sigma))

这给了我以下情节：

Answer 3

逻辑似乎是正确的。

问题在于显示数据。

尝试使用normed=true归一化第一直方图，并对两个直方图具有相等的bin。像20个垃圾箱

Answer 4

掷骰子是均匀分布的情况。从1到6的任何数字出现的概率为1/6。因此，均值和标准差由

给出

现在，CLT表示，对于足够大的n值（在代码中为10），n个均值的平均值pdf将接近均值3.5且标准差为1.7078 / sqrt（10）的正态分布。

n_bins=50
pdf_from_hist, bin_edges=np.histogram(np.array(avg), bins=n_bins, density=True)
bin_mid_pts= np.add(bin_edges[:-1], bin_edges[1:])*0.5
assert(len(list(pdf_from_hist))  == len(list(bin_mid_pts)))
expected_std=1.7078/math.sqrt(10)
expected_mean=3.5
pk_s=[]
qk_s=[]
for i in range(n_bins):
    p=stat.norm.pdf(bin_mid_pts[i], expected_mean, expected_std) 
    q=pdf_from_hist[i]
    if q <= 1.0e-5:
        continue
    pk_s.append(p)
    qk_s.append(q)
#compute the kl divergence
kl_div=stat.entropy(pk_s, qk_s)
print('the pdf of the mean of the 10 throws differ from the corresponding normal dist with a kl divergence of %r' % kl_div)

如何绘制正态分布曲线和中心极限定理

4 个答案:

编辑