Question

我想产生一种特定的可视化类型，它由一个相当简单的dot plot组成，但有一个转折：两个轴都是分类变量（即有序或非数字值）。这使事情变得复杂而不是使事情变得简单。

为了说明这个问题，我将使用一个小的示例数据集，该数据集是对FieldArray的修改并定义为：

seaborn.load_dataset("tips")

我生成图表的第一种方法是尝试这样调用import pandas from six import StringIO df = """total_bill | tip | sex | smoker | day | time | size 16.99 | 1.01 | Male | No | Mon | Dinner | 2 10.34 | 1.66 | Male | No | Sun | Dinner | 3 21.01 | 3.50 | Male | No | Sun | Dinner | 3 23.68 | 3.31 | Male | No | Sun | Dinner | 2 24.59 | 3.61 | Female | No | Sun | Dinner | 4 25.29 | 4.71 | Female | No | Mon | Lunch | 4 8.77 | 2.00 | Female | No | Tue | Lunch | 2 26.88 | 3.12 | Male | No | Wed | Lunch | 4 15.04 | 3.96 | Male | No | Sat | Lunch | 2 14.78 | 3.23 | Male | No | Sun | Lunch | 2""" df = pandas.read_csv(StringIO(df.replace(' ','')), sep="|", header=0)：

seaborn

此操作失败，并显示以下信息：

import seaborn
axes = seaborn.pointplot(x="time", y="sex", data=df)

等效的ValueError: Neither the `x` nor `y` variable appears to be numeric.和seaborn.stripplot调用也是如此。但是，如果其中一个变量是分类变量，而另一个是数字变量，则它确实起作用。确实seaborn.swarmplot可以用，但不是我想要的。

我也尝试过这样的散点图：

seaborn.pointplot(x="total_bill", y="sex", data=df)

这将产生以下图形，该图形不包含任何抖动，并且所有点都重叠，从而使其无用：

您知道有什么优雅的方法或库可以解决我的问题吗？

我开始自己写一些东西，这将在下面进行介绍，但是这种实现方式不是最佳的，并且受在同一位置可以重叠的点数的限制（当前，如果重叠的点数超过4，则会失败）。

axes = seaborn.scatterplot(x="time", y="sex", size="day", data=df,
                           x_jitter=True, y_jitter=True)

Answer 1

您可以先将time和sex转换为分类类型，然后进行一些调整：

df.sex = pd.Categorical(df.sex)
df.time = pd.Categorical(df.time)

axes = sns.scatterplot(x=df.time.cat.codes+np.random.uniform(-0.1,0.1, len(df)), 
                       y=df.sex.cat.codes+np.random.uniform(-0.1,0.1, len(df)),
                       size=df.tip)

输出：

有了这个主意，您可以将上述代码中的偏移量（np.random）修改为相应的距离。例如：

# grouping
groups = df.groupby(['time', 'sex'])

# compute the number of samples per group
num_samples = groups.tip.transform('size')

# enumerate the samples within a group
sample_ranks = df.groupby(['time']).cumcount() * (2*np.pi) / num_samples

# compute the offset
x_offsets = np.where(num_samples.eq(1), 0, np.cos(df.sample_rank) * 0.03)
y_offsets = np.where(num_samples.eq(1), 0, np.sin(df.sample_rank) * 0.03)

# plot
axes = sns.scatterplot(x=df.time.cat.codes + x_offsets, 
                       y=df.sex.cat.codes + y_offsets,
                       size=df.tip)

输出：

具有两个分类变量的Matplotlib点图

1 个答案: