Python拟合分布。与不同数据大小的比较

时间：2018-07-10 14:44:51

标签： python scipy statistics curve-fitting data-fitting

我正在尝试使分布适合我的数据。我将数据分为四个不同的长度（99.9％，99.0％，95.0％和90.0％长度）。拟合是使用SciPy拟合方法计算的。我的问题是，数据长度越短，拟合结果越差。但是，如果我的数据间隔较短，则拟合应该更容易。比较使用R²和SSE。

适合99.9％的数据：

拟合

n4, bins4, patches4 = plt.hist(h4, bins=binwidth, normed=1, facecolor='#023d6b', alpha=0.5, histtype='bar')

lnspc =np.arange(0,int(out_threshold4)-0.5,0.5)

m,s = stats.norm.fit(h4)
pdf_g=stats.norm.pdf(lnspc,m,s)
#plt.plot(lnspc,pdf_g, label="Norm")

ag,bg,cg = stats.gamma.fit(h4)  
pdf_gamma = stats.gamma.pdf(lnspc, ag, bg,cg)  
plt.plot(lnspc, pdf_gamma, label="Gamma")

ab,bb,cb,db = stats.beta.fit(h4)  
pdf_beta = stats.beta.pdf(lnspc, ab, bb,cb, db)  
#plt.plot(lnspc, pdf_beta, label="Beta")

gevfit = gev.fit(h4)  
pdf_gev = gev.pdf(lnspc, *gevfit)   
plt.plot(lnspc, pdf_gev, label="GEV")

logfit = stats.lognorm.fit(h4)  
pdf_lognorm = stats.lognorm.pdf(lnspc, *logfit)  
plt.plot(lnspc, pdf_lognorm, label="LogNormal")

weibfit = stats.weibull_min.fit(h4,loc=0.1)  
pdf_weib = stats.weibull_min.pdf(lnspc, *weibfit)  
plt.plot(lnspc, pdf_weib, label="Weibull")

burr12fit = stats.burr12.fit(h4,loc=0.1)  
pdf_burr12 = stats.burr12.pdf(lnspc, *burr12fit)  
plt.plot(lnspc, pdf_burr12, label="Burr")

genparetofit = stats.genpareto.fit(h4)
pdf_genpareto = stats.genpareto.pdf(lnspc, *genparetofit)
plt.plot(lnspc, pdf_genpareto, label ="Gen-Pareto")

myarray = np.array(h4)   

clf = GMM(8,n_iter=500, random_state=3)
myarray.shape = (myarray.shape[0],1)
clf = clf.fit(myarray)
lnspc.shape = (lnspc.shape[0],1)
pdf_gmm = np.exp(clf.score(lnspc))
plt.plot(lnspc, pdf_gmm, label = "GMM")

R²的计算

df4 = pd.DataFrame({'Strecke': bins4[:-1]+1, 'Propability': n4})

slope, intercept, r_value_norm4, p_value, std_err = stats.linregress(df4['Propability'],pdf_g)
#print ("R-squared Normal Distribution:", r_value_norm**2)

slope, intercept, r_value_gamma4, p_value, std_err = stats.linregress(df4['Propability'],pdf_gamma)
#print ("R-squared Gamma Distribution:", r_value_gamma**2)

slope, intercept, r_value_beta4, p_value, std_err = stats.linregress(df4['Propability'],pdf_beta)
#print ("R-squared Beta Distribution:", r_value_beta**2)

slope, intercept, r_value_gev4, p_value, std_err = stats.linregress(df4['Propability'],pdf_gev)
#print ("R-squared GEV Distribution:", r_value_gev**2)

slope, intercept, r_value_lognorm4, p_value, std_err = stats.linregress(df4['Propability'],pdf_lognorm)
#print ("R-squared LogNormal Distribution:", r_value_lognorm**2)

slope, intercept, r_value_weibull4, p_value, std_err = stats.linregress(df4['Propability'],pdf_weib)
#print ("R-squared Weibull Distribution:", r_value_weibull**2)

slope, intercept, r_value_burr12ull4, p_value, std_err = stats.linregress(df4['Propability'],pdf_burr12)

slope, intercept, r_value_genpareto4, p_value, std_err = stats.linregress(df4['Propability'],pdf_genpareto)

slope, intercept, r_value_gmm4, p_value, std_err = stats.linregress(df4['Propability'].values,pdf_gmm)

SSE的计算

for j in range(0,len(df4['Propability'])-1):

 s4rss_norm += (df4['Propability'].iloc[j+1] - pdf_g[j+1])**2
 s4rss_gamma += (df4['Propability'].iloc[j+1] - pdf_gamma[j+1])**2
 s4rss_beta += (df4['Propability'].iloc[j+1] - pdf_beta[j+1])**2
 s4rss_gev += (df4['Propability'].iloc[j+1] - pdf_gev[j+1])**2
 s4rss_lognorm += (df4['Propability'].iloc[j+1] - pdf_lognorm[j+1])**2
 s4rss_weib += (df4['Propability'].iloc[j+1] - pdf_weib[j+1])**2
 s4rss_burr12 += (df4['Propability'].iloc[j+1] - pdf_burr12[j+1])**2
 s4rss_genpareto += (df4['Propability'].iloc[j+1] - pdf_genpareto[j+1])**2
 s4rss_gmm += (df4['Propability'].iloc[j+1] - pdf_gmm[j+1])**2

与代表数据类似地进行计算，例如：90.0％

下面的图片显示了配件，以及得到的R²和SSE。

我的主管和我都在考虑，应该更好地拟合90％的数据，因为这样会减少异常值。较小数据范围的SSE值也应较小，因为要检查的数据点较少？还有其他适合这些分布的选择吗？我错过了什么？

0 个答案:

没有答案