一个关于汇总的问题,但是这次的总数位于多个字段中,一组基于标准,第二组基于总计。我的df:
display_name security_type1 currency_str state rfq_qty_CAD_Equiv
A GOVT USD Done 100,000
B CORP NZD Passed 100,000
B CORP USD Done 100,000
C CORP EUR Done 100,000
C CORP EUR Traded Away 100,000
C CORP GBP Done 100,000
C CORP GBP Done 100,000
C CORP USD Done 100,000
所需的输出如下:
display_name security_type1 currency_str Done_RFQ Done_RFQ_Volume
A GOVT USD 1 100,000
B CORP USD 1 100,000
C CORP EUR 1 100,000
C CORP GBP 2 200,000
C CORP USD 1 100,000
Total_RFQ Total RFQ_Volume Done_Pct
1 100,000 100%
2 200,000 50%
5 500,000 20%
5 500,000 40%
5 500,000 20%
也就是说:
display_name
,security_type1
和currency_str
上分组Done_RFQ
是其中state
包含其中任何带有Done
的字符串的行数。Done_RFQ_Volume
是rfq_qty_CAD_Equiv
的总和,其中state
包含任何带有Done
的字符串,即,点2为真。 Total_RFQ
是所有唯一display_name
,security_type1
和currency_str
组合的计数,无论Done
和Total RFQ_Volume
是rfq_qty_CAD_Equiv
的总和。 Done_Pct
= Done_RFQ
/ Total_RFQ
我的实现目标:
d = [
('Done_RFQ', lambda x: x.str.contains('Done').sum())
('Done_RFQ_Volume', 'sum'),
('Total_RFQ', 'size'),
('Total RFQ_Volume', 'sum')
]
df_Done_Client_Hit_Rate_Volume = df.groupby(['cust_cdr_display_name','rbc_security_type1','currency_str']).agg(d).reset_index()
df_Done_Client_Hit_Rate_Volume['Hit Rate'] = df_Done_Client_Hit_Rate_Volume['Done_RFQ'] / df_Done_Client_Hit_Rate_Volume['Total_RFQ']
display(df_Done_Client_Hit_Rate_Volume)
在确定rfq_qty_CAD_Equiv的总和时,我不确定如何引用“完成”行而不是“完成”行。两个体积列(以d =列出)基于标准的结果。任何帮助将不胜感激。
答案 0 :(得分:1)
使用:
#convert column to numeric if necessary
df['rfq_qty_CAD_Equiv'] = df['rfq_qty_CAD_Equiv'].str.replace(',','').astype(int)
d = [
('Done_RFQ_Volume', 'sum'),
('Done_RFQ', 'size'),
]
#first filter by substring and then aggregate of filtered df
mask = df['state'].str.contains('Done')
df1 = (df[mask].groupby(['display_name','security_type1','currency_str'])['rfq_qty_CAD_Equiv']
.agg(d)
.reset_index())
print (df1)
display_name security_type1 currency_str Done_RFQ_Volume Done_RFQ
0 A GOVT USD 100000 1
1 B CORP USD 100000 1
2 C CORP EUR 100000 1
3 C CORP GBP 200000 2
4 C CORP USD 100000 1
d = [
('Total RFQ_Volume', 'sum'),
('Total_RFQ', 'size'),
]
#aggregate by column display_name only
df2 = df.groupby(['display_name'])['rfq_qty_CAD_Equiv'].agg(d)
print (df2)
Total RFQ_Volume Total_RFQ
display_name
A 100000 1
B 200000 2
C 500000 5
#join both df together
df_Done_Client_Hit_Rate_Volume = df1.join(df2, on='display_name')
df_Done_Client_Hit_Rate_Volume['Hit Rate'] = df_Done_Client_Hit_Rate_Volume['Done_RFQ'] /
df_Done_Client_Hit_Rate_Volume['Total_RFQ']
print (df_Done_Client_Hit_Rate_Volume)
display_name security_type1 currency_str Done_RFQ_Volume Done_RFQ \
0 A GOVT USD 100000 1
1 B CORP USD 100000 1
2 C CORP EUR 100000 1
3 C CORP GBP 200000 2
4 C CORP USD 100000 1
Total RFQ_Volume Total_RFQ Hit Rate
0 100000 1 1.0
1 200000 2 0.5
2 500000 5 0.2
3 500000 5 0.4
4 500000 5 0.2