Question

我正在尝试制作一个类似于 SO-answer for matplotlib 中的嵌套箱线图，但我无法弄清楚如何创建我的数据框。

这样做的目的是对表示对象位置的 PCA 模型进行某种敏感性分析（在 3D 中）；根据我使用的 PCA 组件的数量，我可以看到 PCA 模型能够很好地表示拱形分布。

所以我有一个形状数组（n_pca_components、n_samples、n_objects），其中包含对象到拱形上“理想”位置的距离。我能够进行箱线图的是这个（显示随机数据的示例）：这是 - 我假设 - 聚合箱线图（在数组的前两个轴上收集的统计数据）；我想创建一个具有相同 x 轴和 y 轴的箱线图，但是对于每个 'obj_..' 我想要沿我的数据的第一个轴的每个值的箱线图（n_pca_components)，即这样的东西（其中天数对应于 'obj_i's，'total_bill' 对应于我存储的距离，'smoker' 对应于数组第一个轴上的每个条目。

我四处阅读，但迷失在熊猫的多索引、groupby、(un)stack、reset_index 等概念中……我看到的所有示例都有不同的数据结构，我认为这就是问题所在，我没有'还没有做出心理上的'点击'并且正在考虑错误的数据结构。

到目前为止我所拥有的是（使用随机/示例数据）：

n_pca_components = 5  # Let's say I want to make this analysis for using 3, 6, 9, 12, 15 PCA components
n_objects = 14   # 14 objects per sample
n_samples = 100  # 100 samples

# Create random data
mses = np.random.rand(n_pca_components, n_samples, n_objects)   # Simulated errors

# Create column names
n_comps = [f'{(i+1) * 3}' for i in range(n_pca_components)]
object_ids = [f'obj_{i}' for i in range(n_objects)]
samples = [f'sample_{i}' for i in range(n_samples)]

# Create panda dataframe
mses_pd = mses.reshape(-1, 14)
midx = pd.MultiIndex.from_product([n_comps, samples], names=['n_comps', 'samples'])

mses_frame = pd.DataFrame(data=mses_pd, index=midx, columns=object_ids)

# Make a nested boxplot with `object_ids` on the 'large' X-axis and `n_comps` on each 'nested' X-axis; and the box-statistics about the mses stored in `mses_frame` on the y-axis.

# Things I tried (yes, I'm a complete pandas-newbie). I've been reading a lot of SO-posts and documentation but cannot seem to figure out how to do what I want.
sns.boxplot(data=mses_frame, hue='n_comps')  # ValueError: Cannot use `hue` without `x` and `y`
sns.boxplot(data=mses_frame, hue='n_comps', x='object_ids') # ValueError: Could not interpret input 'object_ids'
sns.boxplot(data=mses_frame, hue='n_comps', x=object_ids) # ValueError: Could not interpret input 'n_comps'
sns.boxplot(data=mses_frame, hue=n_comps, x=object_ids) # ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Answer 1

这是你想要的吗？

虽然我认为 seaborn 可以处理宽数据，但我个人发现处理 "tidy data"（或长数据）更容易。要将数据框从“宽”转换为“长”，您可以使用 DataFrame.melt 并确保保留您的输入。

所以

>>> mses_frame.melt(ignore_index=False)

                  variable     value
n_comps samples
3       sample_0     obj_0  0.424960
        sample_1     obj_0  0.758884
        sample_2     obj_0  0.408663
        sample_3     obj_0  0.440811
        sample_4     obj_0  0.112798
...                    ...       ...
15      sample_95   obj_13  0.172044
        sample_96   obj_13  0.381045
        sample_97   obj_13  0.364024
        sample_98   obj_13  0.737742
        sample_99   obj_13  0.762252

[7000 rows x 2 columns]

同样，seaborn 可能可以以某种方式使用它（也许其他人可以对此发表评论）但我发现重置索引更容易，因此您的多索引成为列

>>> mses_frame.melt(ignore_index=False).reset_index()

     n_comps    samples variable     value
0          3   sample_0    obj_0  0.424960
1          3   sample_1    obj_0  0.758884
2          3   sample_2    obj_0  0.408663
3          3   sample_3    obj_0  0.440811
4          3   sample_4    obj_0  0.112798
...      ...        ...      ...       ...
6995      15  sample_95   obj_13  0.172044
6996      15  sample_96   obj_13  0.381045
6997      15  sample_97   obj_13  0.364024
6998      15  sample_98   obj_13  0.737742
6999      15  sample_99   obj_13  0.762252

[7000 rows x 4 columns]

现在你可以决定你想要绘制什么了，我想你是说你想要

sns.boxplot(x="variable", y="value", hue="n_comps", 
            data=mses_frame.melt(ignore_index=False).reset_index())

如果我误解了什么，请告诉我

如何使用 3D 数组的 seaborn 制作嵌套/分组箱线图

1 个答案: