Pandas Groupby:同一列上的聚合,但基于多个critera的多个总计

时间:2018-07-10 05:09:41

标签: python pandas dataframe pandas-groupby

一个关于汇总的问题,但是这次的总数位于多个字段中,一组基于标准,第二组基于总计。我的df:

display_name    security_type1  currency_str    state   rfq_qty_CAD_Equiv
A                     GOVT           USD        Done        100,000
B                     CORP           NZD        Passed      100,000
B                     CORP           USD        Done        100,000
C                     CORP           EUR        Done        100,000
C                     CORP           EUR        Traded Away 100,000
C                     CORP           GBP        Done        100,000
C                     CORP           GBP        Done        100,000
C                     CORP           USD        Done        100,000

所需的输出如下:

           display_name security_type1  currency_str    Done_RFQ    Done_RFQ_Volume 
                A               GOVT            USD             1           100,000         
                B               CORP            USD             1           100,000
                C               CORP            EUR             1           100,000
                C               CORP            GBP             2           200,000
                C               CORP            USD             1           100,000

            Total_RFQ   Total RFQ_Volume    Done_Pct
                1           100,000         100%
                2           200,000         50%
                5           500,000         20%
                5           500,000         40%
                5           500,000         20%

也就是说:

  1. display_namesecurity_type1currency_str上分组
  2. Done_RFQ是其中state包含其中任何带有Done的字符串的行数。
  3. Done_RFQ_Volumerfq_qty_CAD_Equiv的总和,其中state包含任何带有Done的字符串,即,点2为真。
  4. Total_RFQ是所有唯一display_namesecurity_type1currency_str组合的计数,无论Done和state > 对于第4点为true的所有记录,
  5. Total RFQ_Volumerfq_qty_CAD_Equiv的总和。
  6. 最后显示“完成”占总数的百分比,即Done_Pct = Done_RFQ / Total_RFQ

我的实现目标:

d = [
     ('Done_RFQ', lambda x: x.str.contains('Done').sum())    
     ('Done_RFQ_Volume', 'sum'), 
     ('Total_RFQ', 'size'), 
     ('Total RFQ_Volume', 'sum')      
    ]
df_Done_Client_Hit_Rate_Volume = df.groupby(['cust_cdr_display_name','rbc_security_type1','currency_str']).agg(d).reset_index()
df_Done_Client_Hit_Rate_Volume['Hit Rate'] = df_Done_Client_Hit_Rate_Volume['Done_RFQ'] / df_Done_Client_Hit_Rate_Volume['Total_RFQ'] 
display(df_Done_Client_Hit_Rate_Volume)

在确定rfq_qty_CAD_Equiv的总和时,我不确定如何引用“完成”行而不是“完成”行。两个体积列(以d =列出)基于标准的结果。任何帮助将不胜感激。

1 个答案:

答案 0 :(得分:1)

使用:

#convert column to numeric if necessary
df['rfq_qty_CAD_Equiv'] = df['rfq_qty_CAD_Equiv'].str.replace(',','').astype(int)

d = [
     ('Done_RFQ_Volume', 'sum'), 
     ('Done_RFQ', 'size'), 
    ]

#first filter by substring and then aggregate of filtered df
mask = df['state'].str.contains('Done')
df1 = (df[mask].groupby(['display_name','security_type1','currency_str'])['rfq_qty_CAD_Equiv']
               .agg(d)
               .reset_index())

print (df1)
  display_name security_type1 currency_str  Done_RFQ_Volume  Done_RFQ
0            A           GOVT          USD           100000         1
1            B           CORP          USD           100000         1
2            C           CORP          EUR           100000         1
3            C           CORP          GBP           200000         2
4            C           CORP          USD           100000         1

d = [
     ('Total RFQ_Volume', 'sum'), 
     ('Total_RFQ', 'size'), 
    ]

#aggregate by column display_name only
df2 = df.groupby(['display_name'])['rfq_qty_CAD_Equiv'].agg(d)
print (df2)
              Total RFQ_Volume  Total_RFQ
display_name                             
A                       100000          1
B                       200000          2
C                       500000          5

#join both df together
df_Done_Client_Hit_Rate_Volume = df1.join(df2, on='display_name')
df_Done_Client_Hit_Rate_Volume['Hit Rate'] = df_Done_Client_Hit_Rate_Volume['Done_RFQ'] / 
                                              df_Done_Client_Hit_Rate_Volume['Total_RFQ'] 

print (df_Done_Client_Hit_Rate_Volume)
  display_name security_type1 currency_str  Done_RFQ_Volume  Done_RFQ  \
0            A           GOVT          USD           100000         1   
1            B           CORP          USD           100000         1   
2            C           CORP          EUR           100000         1   
3            C           CORP          GBP           200000         2   
4            C           CORP          USD           100000         1   

   Total RFQ_Volume  Total_RFQ  Hit Rate  
0            100000          1       1.0  
1            200000          2       0.5  
2            500000          5       0.2  
3            500000          5       0.4  
4            500000          5       0.2