将NA包括在seaborn boxplot中

时间:2018-11-16 13:12:55

标签: python seaborn

我能否将缺失的数据显示为海底生物的额外因素?谷歌搜索了一段时间。

这是我正在使用的简单代码:

ax = sns.boxplot(data=df, x=x, y=y)

对于value_counts,有诸如dropna之类的选项:

df['bla'].value_counts(dropna = False)

但是我找不到它用于箱线图。谢谢。

1 个答案:

答案 0 :(得分:1)

不,你不能。 至少不是直接与seaborn接触。

与NaN值有关的问题已在seaborn for lineplotpairplot中公开。但是ticket from 2014似乎表明seaborn忽略了从0.4开始的缺失值。可以从seaborn的源代码categorical.py

中确认
box_data = remove_na(group_data)

我能想到的最好的办法是创建一个额外的分类列,以表示有效/无效的列数据状态。

然后,我将进行2次细分:  -一个counplot,显示您关注的列的有效/无效数据的nb  -基于该列的常规海积图

此外,可以访问箱形图以show the nb of points taken into account for each boxplot可以对条形图进行类似的操作。

另一种方法是使用value_count intel并将其添加为annotation来绘制

示例:

import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

def custom(val):
    if val >= 0.0:
        return np.NaN
    return val

df = pd.DataFrame(np.random.randn(500, 3))
df = df.rename(index=int, columns={0: 'col_1', 1: 'col_2', 2: 'col_3'})
df['four'] = 'bar'
df['five'] = df['col_1'] > 0
df['category'] = pd.cut(df['col_2'], bins=3, labels=['titi', 'tata', 'toto'])
df['col_3'] = df['col_1'].apply(custom)
df['is_col_3_na'] = pd.isna(df['col_3'])

fig, (ax1, ax2) = plt.subplots(1, 2)
validdf = df[(df['is_col_3_na'] == False)].copy()

sns.countplot(data=df, x='is_col_3_na', ax=ax1).set_title('col_3 valid/invalid data ratios')
sns.boxplot(data=validdf, x='category', y='col_3',
            #hue="category",
            ax=ax2)

print(df['is_col_3_na'].describe())
print(df['is_col_3_na'].value_counts())

# start: taken from https://python-graph-gallery.com/38-show-number-of-observation-on-boxplot/
# with proper modifications
# Calculate number of obs per group & median to position labels
medians = validdf.groupby(['category'])['col_3'].median().values
nobs = validdf['category'].value_counts().values
nobs = [str(x) for x in nobs.tolist()]
nobs = ["n: " + i for i in nobs]

# Add it to the plot
pos = range(len(nobs))
for tick, label in zip(pos, ax2.get_xticklabels()):
    ax2.text(pos[tick], medians[tick] + 0.03, nobs[tick],
             horizontalalignment='center', size='x-small', color='b', weight='semibold')
# end: taken from https://python-graph-gallery.com/38-show-number-of-observation-on-boxplot/
plt.show()

输出:

enter image description here

控制台打印(关于'col_3'列):

count      500
unique       2
top       True
freq       254
Name: is_col_3_na, dtype: object

True     254
False    246
Name: is_col_3_na, dtype: int64