堆积条形图由群计数在熊猫Python上

时间:2016-11-24 10:17:27

标签: python pandas numpy matplotlib plot

我的csv数据类似于下面提供的数据。我想用pandas / python创建一个堆栈条形图,其中每个条形图代表有两种颜色的男性和女性部分,在条形图的顶部显示服用该药物的男性和女性的总数(在我的情况下)。例如,对于20岁的人来说,总共7人,其中6人是男性,1人是女性,所以在酒吧的情况下,酒吧的顶部应该有7个,这个6:1的部分显示在酒吧中。两种颜色。我设法根据他们的年龄计划对人们进行分组并绘制它,但我想要显示具有不同颜色的两种性别的酒吧。任何帮助将不胜感激 。谢谢。

Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

data = pd.read_csv('/home/MedResearch/Desktop/drug_ids_age_gender.csv')
data.values

df = pd.DataFrame(data)
df2 = pd.merge(df1,df,  left_index = True, right_index = True)
temp1 = df2.groupby('Age').Age.count()

df3 = pd.merge(df1,df,  left_index = True, right_index = True)
temp2 = df3.groupby('Gender').Age.count()

ax = temp1.plot(kind='bar',stacked = False, colormap = 'Paired')
for p in ax.patches:
    ax.annotate(np.round(p.get_height(),decimals=0), (p.get_x()+p.get_width()/2.,   p.get_height()), ha='center', va='center', xytext=(2, 10), textcoords='offset points')
plt.ylabel('Numbers Of Patients Taking the drug')
plt.title('Age Distribution Of Patients Based Upon drug Intake')
plt.show()

结果有这样的结果:

enter image description here

1 个答案:

答案 0 :(得分:5)

这个问题经常回来,所以我决定写一步一步的解释。请注意,我不是pandas大师,所以有些东西可能会被优化。

我开始生成一个我将用于x轴的年龄列表:

cvsdata = '''Drug_ID,Age,Gender
FF90E5F780133BBCAEE9BA0E4654C5CA,15,F
2AB2529352E6A1D0BA91E167C5191231,17,M
6E4B2C80EA83C5B51675BC077A182E06,19,M
8AEB19A68DE4C301154973E0C87C774D,19,M
FA533E7DD1FCAB83822D4A2FC514AEE7,20,F
208F39485A8FBA7214CA004B53CC29B7,20,M
3172B95E8A5732D2EDB089A354787612,20,M
38FBAE52AAB04E56AB3A35C4AFBF3813,20,M
62D8A33130574C984EAF3A2E80C243A6,20,M
646AF35E192885FE1336649BA3735FC4,20,M
778D2B1354F30ED3D3BDD7B9437CF670,20,M
891312933FE5FB25703B3E958EF943E3,21,F
5CC63DFC5EF399F65CB9BC583A770DFB,21,M
97C9ED5750EC6FF618A059085F0BF4A6,21,M
C6F03C5E3CC37249E0308A09292F5A58,21,M
0713C427BDC3A90F291AF49599987F38,23,F
518EBD35FCDAB7C744334F993D8CEB28,23,F
A3FC185C4531AFF1E8B7B9F5985028CB,23,F
1837406921314CB182FB0C7BC5565204,23,M
CA0E6274BD39E8DE5C823F4E6F234252,23,M
E2F64030BB011C11541EDC192BAED09A,23,M
EF8F3D028C5759858AB7574864833015,23,M
5DCD427F26E05CC1C3F565BB05EAE10B,24,F
8646ED503722C3C6C6B44208EF1A5202,24,F
F9F45112C472282778E1F437F54B0B70,24,F
0CF4F20B89303CB4C03CF3BD3B27CAF0,24,M
63727039DFF77A46995DA8EDBC4E3497,24,M'''

df = pd.read_csv(StringIO.StringIO(cvsdata))
ages = df.Age.unique()

array([15, 17, 19, 20, 21, 23, 24])

然后我生成了一个分组的数据框,其中包含每个年龄的每个M和F的计数:

counts = df.groupby(['Age','Gender']).count()
print counts

            Drug_ID
Age Gender         
15  F             1
17  M             1
19  M             2
20  F             1
    M             6
21  F             1
    M             3
23  F             3
    M             4
24  F             3
    M             2

使用它,我可以轻松计算每个年龄组的个人总数:

totals = counts.sum(level=0)
print totals

     Drug_ID
Age         
15         1
17         1
19         2
20         7
21         4
23         7
24         5

为了准备绘图,我将转换我的counts数据框,按列而不是索引分隔每个性别。在这里,我还删除了'Drug_ID'列名,因为unstack()操作创建了一个MultiIndex,并且在没有MultiIndex的情况下操作数据帧要容易得多。

counts = counts.unstack(level=1)
counts.columns = counts.columns.droplevel(level=0)
print counts

Gender    F    M
Age             
15      1.0  NaN
17      NaN  1.0
19      NaN  2.0
20      1.0  6.0
21      1.0  3.0
23      3.0  4.0
24      3.0  2.0

看起来很不错。我将进行最后的改进,并将NaN替换为0。

counts = counts.fillna(0)
print counts

Gender    F    M
Age             
15      1.0  0.0
17      0.0  1.0
19      0.0  2.0
20      1.0  6.0
21      1.0  3.0
23      3.0  4.0
24      3.0  2.0

使用此数据框,绘制堆积条形图很简单:

plt.bar(ages, counts['M'], bottom=None, color='blue', label='M')
plt.bar(ages, counts['F'], bottom=counts['M'], color='pink', label='F')
plt.legend()
plt.xlabel('Ages')
plt.ylabel('Count')

要在条形图上绘制总计数,我们将使用annotate()函数。我们无法在一次通过中执行此操作,而是循环遍历agestotals(为简单起见,我将valuesflatten()用于它们,因为它们'格式不正确,不确定为什么在这里)

for age,tot in zip(ages,totals.values.flatten()):
    plt.annotate('N={:d}'.format(tot), xy=(age+0.4, tot), xytext=(0,5), textcoords='offset points', ha='center', va='bottom')

注释的坐标为(age+0.4, tot),因为默认情况下条形从x变为x+width width=0.8,因此x+0.4是酒吧,而tot当然是酒吧的全高。为了稍微偏移文本,我将文本在y方向上偏移了几(5)个点。根据自己的喜好调整。

查看the documentation for bar()以调整条形图的参数。 查看the documentation for annotate()以自定义注释

enter image description here