Matplotlib或Seaborn箱形图给出了介于25%和75%之间的四分位数范围。有没有办法为Boxplot提供自定义四分位数间距?我需要获得箱形图,以使四分位数范围介于10%和90%之间。通过查询Google和其他来源,了解到如何在箱形图上获取自定义晶须,而不是自定义四分位间距。希望这里会提供一些有用的解决方案。
答案 0 :(得分:1)
是的,可以在所需的任何百分位数处绘制带有框边的框线图。
但是,首先应该考虑为什么绘制Q1和Q3是maptlotlib的默认行为。这是因为绘制.25和.75百分位数是图形惯例。因此,您应该注意通过更改常规百分位数来误导读者的风险。您还应该仔细考虑改变框的百分位数对离群值分类和框图的晶须的意义。
没有简便的方法可以更改由plt.boxplot
绘制的百分位数。但是,通过查看matplotlib的源代码,我们可以发现matplotlib使用matplotlib.cbook.boxplot_stats
来计算盒子的统计信息。在boxplot_stats
中,我们找到了代码q1, med, q3 = np.percentile(x, [25, 50, 75])
。这是我们可以更改的线,以更改绘制的百分位数。使用这种方法,如果我们觉得需要调整相应的晶须等,我们也可以这样做。
import itertools
from matplotlib.cbook import _reshape_2D
import matplotlib.pyplot as plt
import numpy as np
# Adatped from matplotlib.cbook
def my_boxplot_stats(X, whis=1.5, bootstrap=None, labels=None,
autorange=False, percents=[25, 75]):
def _bootstrap_median(data, N=5000):
# determine 95% confidence intervals of the median
M = len(data)
percentiles = [2.5, 97.5]
bs_index = np.random.randint(M, size=(N, M))
bsData = data[bs_index]
estimate = np.median(bsData, axis=1, overwrite_input=True)
CI = np.percentile(estimate, percentiles)
return CI
def _compute_conf_interval(data, med, iqr, bootstrap):
if bootstrap is not None:
# Do a bootstrap estimate of notch locations.
# get conf. intervals around median
CI = _bootstrap_median(data, N=bootstrap)
notch_min = CI[0]
notch_max = CI[1]
else:
N = len(data)
notch_min = med - 1.57 * iqr / np.sqrt(N)
notch_max = med + 1.57 * iqr / np.sqrt(N)
return notch_min, notch_max
# output is a list of dicts
bxpstats = []
# convert X to a list of lists
X = _reshape_2D(X, "X")
ncols = len(X)
if labels is None:
labels = itertools.repeat(None)
elif len(labels) != ncols:
raise ValueError("Dimensions of labels and X must be compatible")
input_whis = whis
for ii, (x, label) in enumerate(zip(X, labels)):
# empty dict
stats = {}
if label is not None:
stats['label'] = label
# restore whis to the input values in case it got changed in the loop
whis = input_whis
# note tricksyness, append up here and then mutate below
bxpstats.append(stats)
# if empty, bail
if len(x) == 0:
stats['fliers'] = np.array([])
stats['mean'] = np.nan
stats['med'] = np.nan
stats['q1'] = np.nan
stats['q3'] = np.nan
stats['cilo'] = np.nan
stats['cihi'] = np.nan
stats['whislo'] = np.nan
stats['whishi'] = np.nan
stats['med'] = np.nan
continue
# up-convert to an array, just to be safe
x = np.asarray(x)
# arithmetic mean
stats['mean'] = np.mean(x)
# median
med = np.percentile(x, 50)
## Altered line
q1, q3 = np.percentile(x, (percents[0], percents[1]))
# interquartile range
stats['iqr'] = q3 - q1
if stats['iqr'] == 0 and autorange:
whis = 'range'
# conf. interval around median
stats['cilo'], stats['cihi'] = _compute_conf_interval(
x, med, stats['iqr'], bootstrap
)
# lowest/highest non-outliers
if np.isscalar(whis):
if np.isreal(whis):
loval = q1 - whis * stats['iqr']
hival = q3 + whis * stats['iqr']
elif whis in ['range', 'limit', 'limits', 'min/max']:
loval = np.min(x)
hival = np.max(x)
else:
raise ValueError('whis must be a float, valid string, or list '
'of percentiles')
else:
loval = np.percentile(x, whis[0])
hival = np.percentile(x, whis[1])
# get high extreme
wiskhi = np.compress(x <= hival, x)
if len(wiskhi) == 0 or np.max(wiskhi) < q3:
stats['whishi'] = q3
else:
stats['whishi'] = np.max(wiskhi)
# get low extreme
wisklo = np.compress(x >= loval, x)
if len(wisklo) == 0 or np.min(wisklo) > q1:
stats['whislo'] = q1
else:
stats['whislo'] = np.min(wisklo)
# compute a single array of outliers
stats['fliers'] = np.hstack([
np.compress(x < stats['whislo'], x),
np.compress(x > stats['whishi'], x)
])
# add in the remaining stats
stats['q1'], stats['med'], stats['q3'] = q1, med, q3
return bxpstats
有了这个适当的位置,我们可以计算统计数据,然后使用plt.bxp
进行绘制。下面,我生成框线图,框线的边缘分别为(1,99),(10,90)和(25,75)个百分位数:
# data
np.random.seed(2019)
data = np.random.normal(size=100)
stats = {}
# compute the boxplot stats
stats['A'] = my_boxplot_stats(data, labels='A', bootstrap=10000, percents=[1, 99])
stats['B'] = my_boxplot_stats(data, labels='B', bootstrap=10000, percents=[10, 90])
stats['C'] = my_boxplot_stats(data, labels='C', bootstrap=10000, percents=[25, 75])
fig, ax = plt.subplots(1, 1)
ax.bxp([stats['A'][0], stats['B'][0], stats['C'][0]], positions=np.r_[:3])
如果您只想快速解决(忽略晶须和异常值分类的任何问题),而不是定义函数my_boxplot_stats
,可以这样做:
import matplotlib.cbook as cbook
import matplotlib.pyplot as plt
import numpy as np
# data
np.random.seed(2019)
data = np.random.normal(size=100)
stats = {}
# compute the boxplot stats
stats['A'] = cbook.boxplot_stats(data, labels='A', bootstrap=10000)
stats['B'] = cbook.boxplot_stats(data, labels='B', bootstrap=10000)
stats['C'] = cbook.boxplot_stats(data, labels='C', bootstrap=10000)
stats['A'][0]['q1'], stats['A'][0]['q3'] = np.percentile(data, [1, 99])
stats['B'][0]['q1'], stats['B'][0]['q3'] = np.percentile(data, [10, 90])
stats['C'][0]['q1'], stats['C'][0]['q3'] = np.percentile(data, [25, 75])
fig, ax = plt.subplots(1, 1)
ax.bxp([stats['A'][0], stats['B'][0], stats['C'][0]], positions=np.r_[:3])
但是,查看生成的图,我们会看到更改百分位数但保持晶须不变会导致某些百分位数出现异常的箱形图。如果您选择此“解决方案”,则应注意这一点