有办法做到这一点吗?我似乎不能轻易地将pandas系列与绘制CDF联系起来。
答案 0 :(得分:57)
我相信你正在寻找的功能是在一个Series对象的hist方法中,它包含matplotlib中的hist()函数
以下是相关文档
In [10]: import matplotlib.pyplot as plt
In [11]: plt.hist?
...
Plot a histogram.
Compute and draw the histogram of *x*. The return value is a
tuple (*n*, *bins*, *patches*) or ([*n0*, *n1*, ...], *bins*,
[*patches0*, *patches1*,...]) if the input contains multiple
data.
...
cumulative : boolean, optional, default : True
If `True`, then a histogram is computed where each bin gives the
counts in that bin plus all bins for smaller values. The last bin
gives the total number of datapoints. If `normed` is also `True`
then the histogram is normalized such that the last bin equals 1.
If `cumulative` evaluates to less than 0 (e.g., -1), the direction
of accumulation is reversed. In this case, if `normed` is also
`True`, then the histogram is normalized such that the first bin
equals 1.
...
例如
In [12]: import pandas as pd
In [13]: import numpy as np
In [14]: ser = pd.Series(np.random.normal(size=1000))
In [15]: ser.hist(cumulative=True, density=1, bins=100)
Out[15]: <matplotlib.axes.AxesSubplot at 0x11469a590>
In [16]: plt.show()
答案 1 :(得分:12)
CDF或累积分布函数图基本上是在X轴上具有排序值并且在Y轴上具有累积分布的图。因此,我将创建一个新系列,其中排序值为索引,累积分布为值。
首先创建一个示例系列:
import pandas as pd
import numpy as np
ser = pd.Series(np.random.normal(size=100))
对系列进行排序:
ser = ser.sort_values()
现在,在继续之前,再次追加最后一个(也是最大的)值。这一步对于小样本量非常重要,以获得无偏见的CDF:
ser[len(ser)] = ser.iloc[-1]
创建一个新系列,其中排序值为索引,累积分布为值:
cum_dist = np.linspace(0.,1.,len(ser))
ser_cdf = pd.Series(cum_dist, index=ser)
最后,将该函数绘制为步骤:
ser_cdf.plot(drawstyle='steps')
答案 2 :(得分:8)
这是最简单的方法。
import pandas as pd
df = pd.Series([i for i in range(100)])
df.hist( cumulative = True )
答案 3 :(得分:6)
可以这样实现:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
series = pd.Series(np.random.normal(size=10000))
fig, ax = plt.subplots()
ax2 = ax.twinx()
n, bins, patches = ax.hist(series, bins=100, normed=False)
n, bins, patches = ax2.hist(
series, cumulative=1, histtype='step', bins=100, color='tab:orange')
plt.savefig('test.png')
如果要删除垂直线,则说明如何完成该here。或者您可以这样做:
ax.set_xlim((ax.get_xlim()[0], series.max()))
我还看到了如何使用seaborn
的优雅解决方案here。
答案 4 :(得分:4)
如果您还对值感兴趣,而不仅仅是对图感兴趣。
import pandas as pd
# If you are in jupyter
%matplotlib inline
# Define your series
s = pd.Series([9, 5, 3, 5, 5, 4, 6, 5, 5, 8, 7], name = 'value')
df = pd.DataFrame(s)
# Get the frequency, PDF and CDF for each value in the series
# Frequency
stats_df = df \
.groupby('value') \
['value'] \
.agg('count') \
.pipe(pd.DataFrame) \
.rename(columns = {'value': 'frequency'})
# PDF
stats_df['pdf'] = stats_df['frequency'] / sum(stats_df['frequency'])
# CDF
stats_df['cdf'] = stats_df['pdf'].cumsum()
stats_df = stats_df.reset_index()
stats_df
# Plot the discrete Probability Mass Function and CDF.
# Technically, the 'pdf label in the legend and the table the should be 'pmf'
# (Probability Mass Function) since the distribution is discrete.
# If you don't have too many values / usually discrete case
stats_df.plot.bar(x = 'value', y = ['pdf', 'cdf'], grid = True)
从连续分布中抽取样本的替代示例,或者您有很多单独的值:
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
# ... all the same calculation stuff to get the frequency, PDF, CDF
# Plot
stats_df.plot(x = 'value', y = ['pdf', 'cdf'], grid = True)
请注意如果假设样本中每个值仅出现一次是非常合理的(通常在连续分布的情况下会遇到),则groupby()
+无需使用agg('count')
(因为计数始终为1)。
在这种情况下,可以使用百分比等级直接进入cdf。
采用这种捷径时要尽力判断! :)
# Define your series
s = pd.Series(np.random.normal(loc = 10, scale = 0.1, size = 1000), name = 'value')
df = pd.DataFrame(s)
# Get to the CDF directly
df['cdf'] = df.rank(method = 'average', pct = True)
# Sort and plot
df.sort_values('value').plot(x = 'value', y = 'cdf', grid = True)
答案 5 :(得分:2)
对我来说,这似乎是一种简单的方法:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
heights = pd.Series(np.random.normal(size=100))
# empirical CDF
def F(x,data):
return float(len(data[data <= x]))/len(data)
vF = np.vectorize(F, excluded=['data'])
plt.plot(np.sort(heights),vF(x=np.sort(heights), data=heights))
答案 6 :(得分:0)
我在“纯” Pandas中找到了另一个解决方案,该解决方案不需要指定要在直方图中使用的垃圾箱数量:
import pandas as pd
import numpy as np # used only to create example data
series = pd.Series(np.random.normal(size=10000))
cdf = series.value_counts().sort_index().cumsum()
cdf.plot()
答案 7 :(得分:0)
如果您想绘制一个“真实”的经验 CDF,该 CDF 恰好在您的数据集 a
的值处跳跃,并且每个值处的跳跃与该值的频率成正比,NumPy 有完成工作的内置函数:
import matplotlib.pyplot as plt
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
y = np.cumsum(counts)
x = np.insert(x, 0, x[0])
y = np.insert(y/y[-1], 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
对 unique()
的调用按排序顺序返回数据值及其对应的频率。 drawstyle='steps-post'
调用中的选项 plot()
确保跳转发生在它们应该发生的地方。为了强制跳转到最小的数据值,代码在 x
和 y
前面插入了一个额外的元素。
示例用法:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
ecdf(xvec)
另一种用法:
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
ecdf(df['x'])
带输出: