分布/直方图重叠的箱线图

时间:2019-10-08 10:35:12

标签: python pandas matplotlib seaborn

简而言之:我基本上是试图根据一个continuos变量绘制一个箱形图,其中每个框代表此变量在一定范围内的条目。最重要的是,我想叠加一个直方图以显示分布(每个间隔中有多少个计数)。

更多解释:我有一个包含不同列的数据框。我感兴趣的是按某个连续的连续X列上的间隔对它们进行分组,并绘制一个X列如何查找X列每个间隔的箱形图。此外,我想叠加一个分布或直方图以显示每个元素上有多少个元素箱形图,或多或少。

我尝试过(但失败了)用可计算的条形图绘制直方图。然后根据每个bin的中间值对数据进行分类,以便我可以在相同的轴上或使用相同的x轴(axes.twinx())上绘制(直方图和箱形图),但是直方图会变形。就像无法将x轴的值识别为一样。

原始直方图:https://imgur.com/8iNnR2u

尝试添加框线图后:https://imgur.com/0LRGTTp

这是我一直在尝试做的一个说明性示例:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


# Generate random data
prices = np.random.uniform(low=-85.0,high=85.0, size=(50,))
value_x = np.random.uniform(low=0,high=3000.0, size=(50,))
df = pd.DataFrame({'price':prices,'value_x':value_x })
# Classify each entry according to the bin they belong to
df['interval_index'] = np.digitize(df['price'], np.arange(-85,85,5))
# Get middle value for each bin, for example, if bin is (40-45), middle value would be 42.5
df['interval_middle_value'] = df['interval_index']*5-87.5

# Failed attempt to generate the desired plot
fig, ax = plt.subplots()
sns.distplot(df['price'],bins=np.arange(-85,85,5), ax=ax, kde=False, norm_hist=False)
ax2=ax.twinx()
sns.boxplot(x='interval_middle_value',y='value_x',data=df, ax=ax2)

我希望得到如下图所示的结果:https://imgur.com/svEOlNQ

1 个答案:

答案 0 :(得分:1)

由于箱线图是分类的,因此需要将箱线的位置设置为箱间隔的中间。 因此,可能您正在寻找这样的东西:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generate random data
prices = np.random.uniform(low=-85.0,high=85.0, size=(500,))
value_x = np.random.uniform(low=0,high=3000.0, size=(500,))
df = pd.DataFrame({'price':prices,'value_x':value_x })
# Classify each entry according to the bin they belong to
bins = np.arange(-85,90,5)

width = np.diff(bins)[0]
df['interval_index'] = np.digitize(df['price'], bins)
# Get middle value for each bin, for example, if bin is (40-45), middle value would be 42.5
middle_value = bins[:-1] + width/2

# Failed attempt to generate the desired plot
fig, ax = plt.subplots()
ax.hist(df['price'].values, bins=bins)
ax2=ax.twinx()

stats = [df['value_x'][df['interval_index'] == i].values for i in range(1, len(bins))]
ax2.boxplot(stats, positions=middle_value, widths=width*0.6, manage_ticks=False)


plt.show()

enter image description here