Question

我将python和matplotlib一起使用，并且需要可视化数据集的子组的分布百分比。

想象这棵树：

Data --- group1 (40%)
     -
     --- group2 (25%)
     -
     --- group3 (35%)


group1 --- A (25%)
       -
       --- B (25%)
       -
       --- c (50%)

它可以继续，每个组可以有几个子组，并且每个子组都相同。

如何为该信息绘制合适的图表？

Answer 1

我创建了一个最小的可复制示例，我认为它符合您的描述，但是如果那不是您所需要的，请告诉我。

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.DataFrame()
n_rows = 100
data['group'] = np.random.choice(['1', '2', '3'], n_rows)
data['subgroup'] = np.random.choice(['A', 'B', 'C'], n_rows)

例如，我们可以为子组获得以下计数。

In [1]: data.groupby(['group'])['subgroup'].value_counts()
Out[1]: group  subgroup
    1   A      17
        C      16
        B      5
    2   A      23
        C      10
        B      7
    3   C      8
        A      7
        B      7
 Name: subgroup, dtype: int64

我创建了一个函数，该函数根据给定的列顺序（例如['group', 'subgroup']）计算必要的计数，并以相应的百分比递增地绘制条形图。

import matplotlib.pyplot as plt
import matplotlib.cm

def plot_tree(data, ordering, axis=False):
    """
    Plots a sequence of bar plots reflecting how the data 
    is distributed at different levels. The order of the 
    levels is given by the ordering parameter.

    Parameters
    ----------
    data: pandas DataFrame
    ordering: list
        Names of the columns to be plotted.They should be 
        ordered top down, from the larger to the smaller group.
    axis: boolean
        Whether to plot the axis.

    Returns
    -------
    fig: matplotlib figure object.
        The final tree plot.
    """

    # Frame set-up
    fig, ax = plt.subplots(figsize=(9.2, 3*len(ordering)))
    ax.set_xticks(np.arange(-1, len(ordering)) + 0.5)
    ax.set_xticklabels(['All'] + ordering, fontsize=18)
    if not axis:
        plt.axis('off')
    counts=[data.shape[0]]

    # Get colormap
    labels = ['All']
    for o in reversed(ordering):
        labels.extend(data[o].unique().tolist())
    # Pastel is nice but has few colors. Change for a larger map if needed
    cmap = matplotlib.cm.get_cmap('Pastel1', len(labels))
    colors = dict(zip(labels, [cmap(i) for i in range(len(labels))]))

    # Group the counts
    counts = data.groupby(ordering).size().reset_index(name='c_' + ordering[-1])
    for i, o in enumerate(ordering[:-1], 1):
        if ordering[:i]:
            counts['c_' + o]=counts.groupby(ordering[:i]).transform('sum')['c_' + ordering[-1]]
    # Calculate percentages
    counts['p_' + ordering[0]] = counts['c_' + ordering[0]]/data.shape[0]
    for i, o in enumerate(ordering[1:], 1):
        counts['p_' + o] = counts['c_' + o]/counts['c_' + ordering[i-1]]

    # Plot first bar - all data
    ax.bar(-1, data.shape[0], width=1, label='All', color=colors['All'], align="edge")
    ax.annotate('All -- 100%', (-0.9, 0.5), fontsize=12)
    comb = 1  # keeps track of the number of possible combinations at each level
    for bar, col in enumerate(ordering):
        labels = sorted(data[col].unique())*comb
        comb *= len(data[col].unique())
        # Get only the relevant counts at this level
        local_counts = counts[ordering[:bar+1] + 
                              ['c_' + o for o in ordering[:bar+1]] + 
                              ['p_' + o for o in ordering[:bar+1]]].drop_duplicates()
        sizes = local_counts['c_' + col]
        percs = local_counts['p_' + col]
        bottom = 0  # start at from 0
        for size, perc, label in zip(sizes, percs, labels):
            ax.bar(bar, size, width=1, bottom=bottom, label=label, color=colors[label], align="edge")
            ax.annotate('{} -- {:.0%}'.format(label, perc), (bar+0.1, bottom+0.5), fontsize=12)
            bottom += size  # stack the bars
    ax.legend(colors)
    return fig

使用上面显示的数据，我们将得到以下内容。

fig = plot_tree(data, ['group', 'subgroup'], axis=True)

Answer 2

您是否尝试过堆叠条形图？

https://matplotlib.org/gallery/lines_bars_and_markers/bar_stacked.html#sphx-glr-gallery-lines-bars-and-markers-bar-stacked-py

如何绘制合适的分布图？

2 个答案: