熊猫直方图:从数据中提取列和分组依据

时间:2018-08-14 21:14:21

标签: python pandas matplotlib

我有一个数据框,正在使用pandas hist()方法的columnby来查看数据子集的直方图,如下所示:

ax = df.hist(column='activity_count', by='activity_month')

(然后我继续绘制此信息)。我正在尝试确定如何以编程方式提取两段数据:当我在轴上循环时,具有特定值'activity_month'以及'activity_month'的记录数:

for i,x in enumerate(ax):`  
   print("the value of a is", a)
   print("the number of rows with value of a", b)

这样我就会得到:

January 1002
February 4305
etc

现在,我可以轻松获得“ activity_month”唯一值的列表,以及有多少行具有给定的activity_month等于该值的计数,

a="January"
len(df[df["activity_month"]=a])

但我想在循环中针对i,x的特定迭代执行此操作。如何在每次迭代中获取“ x”内子集数据的句柄,以便查看迭代中“ activity_month”的值以及具有该值的行数?

1 个答案:

答案 0 :(得分:0)

这是一个简短的示例数据框:

import pandas as pd

df = pd.DataFrame([['January',19],['March',6],['January',24],['November',83],['February',23],
                    ['November',4],['February',98],['January',44],['October',47],['January',4],
                    ['April',8],['March',21],['April',41],['June',34],['March',63]],
                    columns=['activity_month','activity_count'])

收益:

   activity_month  activity_count
0         January              19
1           March               6
2         January              24
3        November              83
4        February              23
5        November               4
6        February              98
7         January              44
8         October              47
9         January               4
10          April               8
11          March              21
12          April              41
13           June              34
14          March              63

如果您想要df.groupby('activity_month')中每个组的值之和,则可以这样做:

df.groupby('activity_month')['activity_count'].sum()

礼物:

activity_month
April        49
February    121
January      91
June         34
March        90
November     87
October      47
Name: activity_count, dtype: int64

要获取与给定组相对应的行数:

df.groupby('activity_month')['activity_count'].agg('count')

礼物:

activity_month
April       2
February    2
January     4
June        1
March       3
November    2
October     1
Name: activity_count, dtype: int64

在重新阅读您的问题之后,我确信您没有以最有效的方式解决这个问题。我强烈建议您不要显式循环使用df.hist()创建的轴,尤其是当这些信息可以从df本身快速(直接)访问时。