我将python和matplotlib一起使用,并且需要可视化数据集的子组的分布百分比。
想象这棵树:
Data --- group1 (40%)
-
--- group2 (25%)
-
--- group3 (35%)
group1 --- A (25%)
-
--- B (25%)
-
--- c (50%)
它可以继续,每个组可以有几个子组,并且每个子组都相同。
如何为该信息绘制合适的图表?
答案 0 :(得分:1)
我创建了一个最小的可复制示例,我认为它符合您的描述,但是如果那不是您所需要的,请告诉我。
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
data = pd.DataFrame()
n_rows = 100
data['group'] = np.random.choice(['1', '2', '3'], n_rows)
data['subgroup'] = np.random.choice(['A', 'B', 'C'], n_rows)
例如,我们可以为子组获得以下计数。
In [1]: data.groupby(['group'])['subgroup'].value_counts()
Out[1]: group subgroup
1 A 17
C 16
B 5
2 A 23
C 10
B 7
3 C 8
A 7
B 7
Name: subgroup, dtype: int64
我创建了一个函数,该函数根据给定的列顺序(例如['group', 'subgroup']
)计算必要的计数,并以相应的百分比递增地绘制条形图。
import matplotlib.pyplot as plt
import matplotlib.cm
def plot_tree(data, ordering, axis=False):
"""
Plots a sequence of bar plots reflecting how the data
is distributed at different levels. The order of the
levels is given by the ordering parameter.
Parameters
----------
data: pandas DataFrame
ordering: list
Names of the columns to be plotted.They should be
ordered top down, from the larger to the smaller group.
axis: boolean
Whether to plot the axis.
Returns
-------
fig: matplotlib figure object.
The final tree plot.
"""
# Frame set-up
fig, ax = plt.subplots(figsize=(9.2, 3*len(ordering)))
ax.set_xticks(np.arange(-1, len(ordering)) + 0.5)
ax.set_xticklabels(['All'] + ordering, fontsize=18)
if not axis:
plt.axis('off')
counts=[data.shape[0]]
# Get colormap
labels = ['All']
for o in reversed(ordering):
labels.extend(data[o].unique().tolist())
# Pastel is nice but has few colors. Change for a larger map if needed
cmap = matplotlib.cm.get_cmap('Pastel1', len(labels))
colors = dict(zip(labels, [cmap(i) for i in range(len(labels))]))
# Group the counts
counts = data.groupby(ordering).size().reset_index(name='c_' + ordering[-1])
for i, o in enumerate(ordering[:-1], 1):
if ordering[:i]:
counts['c_' + o]=counts.groupby(ordering[:i]).transform('sum')['c_' + ordering[-1]]
# Calculate percentages
counts['p_' + ordering[0]] = counts['c_' + ordering[0]]/data.shape[0]
for i, o in enumerate(ordering[1:], 1):
counts['p_' + o] = counts['c_' + o]/counts['c_' + ordering[i-1]]
# Plot first bar - all data
ax.bar(-1, data.shape[0], width=1, label='All', color=colors['All'], align="edge")
ax.annotate('All -- 100%', (-0.9, 0.5), fontsize=12)
comb = 1 # keeps track of the number of possible combinations at each level
for bar, col in enumerate(ordering):
labels = sorted(data[col].unique())*comb
comb *= len(data[col].unique())
# Get only the relevant counts at this level
local_counts = counts[ordering[:bar+1] +
['c_' + o for o in ordering[:bar+1]] +
['p_' + o for o in ordering[:bar+1]]].drop_duplicates()
sizes = local_counts['c_' + col]
percs = local_counts['p_' + col]
bottom = 0 # start at from 0
for size, perc, label in zip(sizes, percs, labels):
ax.bar(bar, size, width=1, bottom=bottom, label=label, color=colors[label], align="edge")
ax.annotate('{} -- {:.0%}'.format(label, perc), (bar+0.1, bottom+0.5), fontsize=12)
bottom += size # stack the bars
ax.legend(colors)
return fig
使用上面显示的数据,我们将得到以下内容。
fig = plot_tree(data, ['group', 'subgroup'], axis=True)
答案 1 :(得分:0)