Pandas Groupby:汇总在同一列上,但基于两个不同的critera /数据框进行汇总

时间:2018-07-09 08:52:00

标签: python pandas dataframe pandas-groupby

我的数据框:

display_name    security_type1  currency_str     state
         A            GOVT           USD         Done
         B            CORP           NZD         Passed
         B            CORP           USD         Done
         C            CORP           EUR         Done
         C            CORP           EUR         Traded Away
         C            CORP           GBP         Done
         C            CORP           GBP         Done
         C            CORP           USD         Done

我受辱的结果是:

a。分组依据display_namesecurity_type1currency_str

b。然后计算column state包含Done的行数并更新列Done_RFQ

c。显示每种display_namesecurity_type1currency_str组合的总行数,并更新列Total_RFQ

d。最后显示“完成”占总数的百分比,即Done_Pct = Done_RFQ / Total_RFQ

display_name    security_type1  currency_str   Done_RFQ Total_RFQ Done_Pct
A               GOVT             USD           1           1      100%
B               CORP             USD           1           2      50%
C               CORP             EUR           1           5      20%
C               CORP             GBP           2           5      40%
C               CORP             USD           1           5      20%

除了Total_RFQDone_Pct之外,我的代码都可以使用

d = [('Done_RFQ', 'size')]
df_Done_Client = df[
                    df['state'].str.contains('Done')
                ][['display_name','security_type1','currency_str','state']].copy()

df_Done_Client =    
    df_Done_Client.groupby(['display_name','security_type1','currency_str'])['state'].agg(d).reset_index()
    # Sum of all Done RFQ's per display_name
    Sum_of_Done_For_Month = df_Done_Client.groupby('display_name')['Done_RFQ'].transform('sum')
    df_Done_Client['Total_Done_RFQ'] = Sum_of_Done_For_Month
    df_Done_Client['Done_Pct'] = df_Done_Client['Done_RFQ_For_Month'].div(Sum_of_Done_For_Month).round(5)
    display(df_Done_Client)

我不清楚如何计算该总数,因为它需要来自另一个数据框,即相同的字段,但没有“完成”条件。

df_All_Client = df[['display_name','security_type1','currency_str','state']].copy()

2 个答案:

答案 0 :(得分:1)

我认为需要Total_RFQ列,其中size-总计数和Done_RFQ的布尔掩码计数-与{{1的Donesum比较}}:

True

如果需要检查子字符串:

d = [('Total_RFQ', 'size'), ('Done_RFQ', lambda x: x.eq('Done').sum())]
df=df.groupby(['display_name','security_type1','currency_str'])['state'].agg(d).reset_index()
df['Done_Pct'] = df['Done_RFQ'] / df['Total_RFQ'] * 100
print (df)
  display_name security_type1 currency_str  Total_RFQ  Done_RFQ  Done_Pct
0            A           GOVT          USD          1         1     100.0
1            B           CORP          NZD          1         0       0.0
2            B           CORP          USD          1         1     100.0
3            C           CORP          EUR          2         1      50.0
4            C           CORP          GBP          2         2     100.0
5            C           CORP          USD          1         1     100.0

答案 1 :(得分:1)

这是一种方式。与@jezrael的解决方案类似,但保持逻辑检查子字符串Done和过滤器Done_RFQ > 0

此外,我认为您需要进行2次groupby计算才能获得所需的结果,即Total_RFQdisplay_name计算得出。

# function to calcuate Done_RFQ
d = {'Done_RFQ': lambda x: x.str.contains('Done', na=False, regex=False).sum()}

# apply 2 groupby calculations
df['Total_RFQ'] = df.groupby('display_name')['display_name'].transform('size')

group_cols = ['display_name', 'security_type1', 'currency_str', 'Total_RFQ']
res = df.groupby(group_cols)['state'].agg(d).reset_index()

# calculate Done_Pct
res['Done_Pct'] = res['Done_RFQ'] / res['Total_RFQ']

# filter for Done_RFQ > 0
res = res[res['Done_RFQ'] > 0]

print(res)

  display_name security_type1 currency_str  Total_RFQ  Done_RFQ  Done_Pct
0            A           GOVT          USD          1         1       1.0
2            B           CORP          USD          2         1       0.5
3            C           CORP          EUR          5         1       0.2
4            C           CORP          GBP          5         2       0.4
5            C           CORP          USD          5         1       0.2