如何从python Seaborn Displot获取可能性分布函数

时间:2018-09-23 20:55:48

标签: python matplotlib seaborn distribution

我想在seaborn.displot()中获得kde fit提供的拟合度的可能性分布函数(PDF),或者当我有x=20时,如何获得曲线上的可能性值?

import matplotlib.pyplot as plt 
import numpy as np
import seaborn as sns

x=np.array([33,42,31,36,36,33, 37 ,37, 28 ,36 ,32, 40 ,43 ,37, 33 ,40 ,41 ,44, 53 ,38, 32, 48, 51, 37 ,29, 41 ,30 ,29 ,28, 40 ,35 ,33 ,33 ,29, 27 ,33, 35, 34, 28 ,35, 39 ,37 ,31 ,33 ,32 ,39 ,24, 30, 29, 21, 28, 28, 29, 29 ,25, 34, 24, 28 ,25, 25 ,27, 18, 27, 27, 35, 26, 29, 29, 30])

sns.distplot(x)

enter image description here

3 个答案:

答案 0 :(得分:1)

似乎没有直接的方法来返回distplot拟合的pdf,但是您可以按以下方式获取pdf行的数据并对其进行绘制,以确保得到相同的拟合

fig, axs = plt.subplots(1,2, figsize=(10,3))

x=np.array([33,42,31,36,36,33, 37 ,37, 28 ,36 ,32, 40 ,43 ,37, 33 ,40 ,41 ,44, 53 ,38, 32, 48, 51, 37 ,29, 41 ,30 ,29 ,28, 40 ,35 ,33 ,33 ,29, 27 ,33, 35, 34, 28 ,35, 39 ,37 ,31 ,33 ,32 ,39 ,24, 30, 29, 21, 28, 28, 29, 29 ,25, 34, 24, 28 ,25, 25 ,27, 18, 27, 27, 35, 26, 29, 29, 30])

ax1 = sns.distplot(x, ax=axs[0], label='KDE pdf')
fit = ax1.get_lines()[0].get_data() # Getting the data from the plotted line
xfit, yfit = fit[0], fit[1]
ax1.legend()

axs[1].plot(xfit, yfit, label='Extracted pdf')
axs[1].set_ylim(ax1.get_ylim())
plt.legend()

拟合不完全包含x=20,但是您可以使用一些公差值来获取最接近x=20的点

输出

enter image description here

答案 1 :(得分:0)

您可以获得用于绘制分布的数据(x和y值)。从中可以插值到两者之间的任何值。

如果要获得概率,则必须对pdf数据进行积分并计算该范围内的值。

import numpy as np
import seaborn as sns
import scipy

x=np.array([33,42,31,36,36,33, 37 ,37, 28 ,36 ,32, 40 ,43 ,37, 33 ,40 ,41 ,44, 53 ,38, 32, 48, 51, 37 ,29, 41 ,30 ,29 ,28, 40 ,35 ,33 ,33 ,29, 27 ,33, 35, 34, 28 ,35, 39 ,37 ,31 ,33 ,32 ,39 ,24, 30, 29, 21, 28, 28, 29, 29 ,25, 34, 24, 28 ,25, 25 ,27, 18, 27, 27, 35, 26, 29, 29, 30])
ax = sns.distplot(x) 

#Value to estimate for
value = 20

#Get the data from the KDE line
xdata, ydata = ax.get_lines()[0].get_data()

#Find the closest point on the curve
idx = (np.abs(xdata-value)).argmin()
#Interpolate to get a better estimate
p = np.interp(value,xdata[idx:idx+2],ydata[idx:idx+2])
print('Point on PDF for X = {} is: {}'.format(value,p))

#Plot the line
ax.vlines(value, 0, p ,colors='r')

#Find the probability for an interval of one (e.g. between 20 and 21)
cdf = scipy.integrate.cumtrapz(ydata, xdata, dx=1, initial=0)
pr = cdf[value+1] - cdf[value]
print('Probability of X <{},{}> is: {}'.format(value,value+1,pr))

# Fill the area 
plt.fill_between(xdata,ydata, where = (xdata>=value) & (xdata<=value+1), color='g')

输出应为:

Point on PDF for X = 20 is: 0.007789463075158201
Probability of X <20,21> is: 0.0015438476906999374

Output distplot

答案 2 :(得分:0)

这很好,但是在计算概率以达到正确目的时,您需要进行很小的校正。

plt.figure(figsize=(15,12))

ax = sns.distplot(comparaison['error_'])
#Value to estimate for
value = 20

#Get the data from the KDE line
xdata, ydata = ax.get_lines()[0].get_data()

#Find the closest point on the curve
idx = (np.abs(xdata-value)).argmin()
#Interpolate to get a better estimate
p = np.interp(value,xdata[idx:idx+2],ydata[idx:idx+2])
print('Point on PDF for X = {} is: {}'.format(value,p))
#Plot the line
ax.vlines(value, 0, p ,colors='r')

#Find the probability for an interval of one (e.g. between 20 and 100)
ecart = 80
idx = (np.abs(xdata-value)).argmin()
idx_ = (np.abs(xdata-(value+ecart))).argmin()

cdf = scipy.integrate.cumtrapz(ydata, xdata, dx=1, initial=0)
pr = cdf[idx_] - cdf[idx] # Error here see old code, need to define idx_ 
print('Probability of X <{},{}> is: {}'.format(value,value+ecart,pr))

# Fill the area 
plt.fill_between(xdata,ydata, where = (xdata>=value) & (xdata<=value+ecart), color='g')