pandas:使用groupby对列值进行求和

时间:2016-05-03 07:18:53

标签: python pandas group-by

我有以下数据框:

url='https://raw.githubusercontent.com/108michael/ms_thesis/master/mpl.Bspons.merge.1'
df=pd.read_csv(url, index_col=0)
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df = df.set_index(['date'])

df.head(3)

    state   year    unemployment    log_diff_unemployment   id.thomas   party   type    bills   id.fec  years_exp   session     name                          disposition   catcode     naics
date                                                            
2006-05-01  AK  2006    6.6     -0.044452   1440    Republican  sen     s2686-109   S2AK00010   39  109     National Cable & Telecommunications Association     support     C4500   81
2006-05-01  AK  2006    6.6     -0.044452   1440    Republican  sen     s2686-109   S2AK00010   39  109     National Cable & Telecommunications Association     support     C4500   517
2007-03-27  AK  2007    6.3     -0.046520   1440    Republican  sen     s1000-110   S2AK00010   40  110     National Treasury Employees Union   support     L1100   NaN

我想总结由catcode > disposition > id.fec定义的每个组中的帐单数量。我使用以下代码:

df['billsum'] = df.groupby([pd.Grouper(level='date', freq='A'), 'catcode', \
        'disposition', 'id.fec']).bills.transform('sum')

返回

df.head(3)

    state   year    unemployment    log_diff_unemployment   id.thomas   party   type    bills   id.fec  years_exp   session     name                    disposition     catcode     naics   billsum
date                                                                
2006-05-01  AK  2006    6.6     -0.044452   1440    Republican  sen     s2686-109   S2AK00010   39  109     National Cable & Telecommunications Association     support     C4500   81  s2686-109s2686-109
2006-05-01  AK  2006    6.6     -0.044452   1440    Republican  sen     s2686-109   S2AK00010   39  109     National Cable & Telecommunications Association     support     C4500   517     s2686-109s2686-109
2007-03-27  AK  2007    6.3     -0.046520   1440    Republican  sen     s1000-110   S2AK00010   40  110     National Treasury Employees Union   support     L1100   NaN     s1000-110

而不是返回'数字'在每组中包含的账单中,代码返回每组中包含的所有账单。我只想要每组中的账单数量。有人知道如何使这项工作?

1 个答案:

答案 0 :(得分:1)

我认为您需要transform size,而不是sum

df['billsum'] = df.groupby([pd.Grouper(level='date', freq='A'), 'catcode', \
        'disposition', 'id.fec']).bills.transform('size')

print df.head(3)
           state    year  unemployment  log_diff_unemployment  id.thomas  \
date                                                                       
2006-05-01    AK  2006.0           6.6              -0.044452       1440   
2006-05-01    AK  2006.0           6.6              -0.044452       1440   
2007-03-27    AK  2007.0           6.3              -0.046520       1440   

                 party type      bills     id.fec  years_exp  session  \
date                                                                    
2006-05-01  Republican  sen  s2686-109  S2AK00010         39      109   
2006-05-01  Republican  sen  s2686-109  S2AK00010         39      109   
2007-03-27  Republican  sen  s1000-110  S2AK00010         40      110   

                                                       name disposition  \
date                                                                      
2006-05-01  National Cable & Telecommunications Association     support   
2006-05-01  National Cable & Telecommunications Association     support   
2007-03-27                National Treasury Employees Union     support   

           catcode naics  billsum  
date                               
2006-05-01   C4500    81        2  
2006-05-01   C4500   517        2  
2007-03-27   L1100   NaN        1