pandas自定义函数适用于融化的数据帧

时间:2017-04-11 20:25:23

标签: pandas apply

我有一个融合的数据框,如下所示:

   date         group    metric   n_events    total_users
0  2017-01-01   control  metric1  33.919910   827.416818
27 2017-01-01  variant1  metric1  55.141467   780.840083
54 2017-01-01  variant2  metric1  63.045587   436.381533
1  2017-01-02   control  metric2  74.013340   145.551779
28 2017-01-02  variant1  metric2  78.539663   553.410827

我想计算融化数据框架上的一些提升指标。到目前为止,我调整了数据帧,这并不理想。

import pandas as pd

df = pd.DataFrame(
    {'group': sorted(['control','variant1','variant2']*27),
     'metric': ['metric1', 'metric2', 'metric3']*27,
     'n_events': np.random.uniform(100,20,size=81),
     'total_users': np.random.uniform(1000, 20, size=81),
     'date' : list(pd.date_range('1/1/2017', periods=27, freq='D'))*3
     })

df = df.sort_values(['date','group','metric'])

t = pd.pivot_table(df, values=['n_events','total_users'],
               index=['date','metric'],
               columns=['group'],
               aggfunc=np.sum).reset_index()

for var in ['variant1','variant2']:
    uplift_colname = var + "_standard_uplift"

# adding daily uplift
    t[uplift_colname] =(t['n_events'][var]/t['total_users'][var])-\
                           (t['n_events']['control']/t['total_users']['control'])

我正在寻找一种更好的方法来获得提升而不必转动数据帧,从而保持融化的数据格式。我尝试使用groupbyapply以及自定义函数,即

df.groupby(['date','metric'])['n_events','group','total_users'].apply(myfxn)

1 个答案:

答案 0 :(得分:2)

def proc(df):
    s = df.groupby('group').sum()
    r = s.n_events / s.total_users
    return r.drop('control').sub(r.loc['control'])

gcols = ['date', 'metric']
ocols = ['group', 'n_events', 'total_users']
suffix = '_standard_uplift'
df.groupby(gcols)[ocols].apply(proc).add_suffix(suffix)

这会获得当前t获得的相同信息

group               variant1_standard_uplift  variant2_standard_uplift
date       metric                                                     
2017-01-01 metric1                 -0.175006                 -0.334146
2017-01-02 metric2                  0.213414                  0.007030
2017-01-03 metric3                  0.041405                  0.913016
2017-01-04 metric1                 -0.102361                 -0.044124
2017-01-05 metric2                  0.114260                  0.031469
2017-01-06 metric3                  0.316760                 -0.113277
2017-01-07 metric1                  3.049462                  0.052456
2017-01-08 metric2                 -0.050300                 -0.015628
2017-01-09 metric3                  0.004769                  0.239641
2017-01-10 metric1                  0.025574                  0.153893
2017-01-11 metric2                  0.111758                  0.083404
2017-01-12 metric3                 -0.175687                 -0.107851
2017-01-13 metric1                  0.147153                  0.266303
2017-01-14 metric2                 -0.162214                 -0.238798
2017-01-15 metric3                  0.137627                  0.010475
2017-01-16 metric1                 -0.223583                 -0.208177
2017-01-17 metric2                  0.154821                  0.189663
2017-01-18 metric3                 -0.161725                 -0.536955
2017-01-19 metric1                 -0.002525                  0.027977
2017-01-20 metric2                 -0.210697                  0.564725
2017-01-21 metric3                 -0.228038                 -0.255461
2017-01-22 metric1                 -0.210647                 -0.141039
2017-01-23 metric2                  0.354086                 -0.366433
2017-01-24 metric3                  0.344310                 -0.045895
2017-01-25 metric1                  0.340080                  0.105040
2017-01-26 metric2                  2.512369                 -0.062200
2017-01-27 metric3                 -1.326842                 -1.819911

保持与df相同的数据框,但附加两个新列......

def proc(df):
    s = df.groupby('group').sum()
    r = s.n_events / s.total_users
    return r.drop('control').sub(r.loc['control'])

gcols = ['date', 'metric']
ocols = ['group', 'n_events', 'total_users']
suffix = '_standard_uplift'
df.join(df.groupby(gcols)[ocols].apply(proc).add_suffix(suffix), on=gcols).sort_index()

enter image description here