我有一个融合的数据框,如下所示:
date group metric n_events total_users
0 2017-01-01 control metric1 33.919910 827.416818
27 2017-01-01 variant1 metric1 55.141467 780.840083
54 2017-01-01 variant2 metric1 63.045587 436.381533
1 2017-01-02 control metric2 74.013340 145.551779
28 2017-01-02 variant1 metric2 78.539663 553.410827
我想计算融化数据框架上的一些提升指标。到目前为止,我调整了数据帧,这并不理想。
import pandas as pd
df = pd.DataFrame(
{'group': sorted(['control','variant1','variant2']*27),
'metric': ['metric1', 'metric2', 'metric3']*27,
'n_events': np.random.uniform(100,20,size=81),
'total_users': np.random.uniform(1000, 20, size=81),
'date' : list(pd.date_range('1/1/2017', periods=27, freq='D'))*3
})
df = df.sort_values(['date','group','metric'])
t = pd.pivot_table(df, values=['n_events','total_users'],
index=['date','metric'],
columns=['group'],
aggfunc=np.sum).reset_index()
for var in ['variant1','variant2']:
uplift_colname = var + "_standard_uplift"
# adding daily uplift
t[uplift_colname] =(t['n_events'][var]/t['total_users'][var])-\
(t['n_events']['control']/t['total_users']['control'])
我正在寻找一种更好的方法来获得提升而不必转动数据帧,从而保持融化的数据格式。我尝试使用groupby
或apply
以及自定义函数,即
df.groupby(['date','metric'])['n_events','group','total_users'].apply(myfxn)
答案 0 :(得分:2)
def proc(df):
s = df.groupby('group').sum()
r = s.n_events / s.total_users
return r.drop('control').sub(r.loc['control'])
gcols = ['date', 'metric']
ocols = ['group', 'n_events', 'total_users']
suffix = '_standard_uplift'
df.groupby(gcols)[ocols].apply(proc).add_suffix(suffix)
这会获得当前t
获得的相同信息
group variant1_standard_uplift variant2_standard_uplift
date metric
2017-01-01 metric1 -0.175006 -0.334146
2017-01-02 metric2 0.213414 0.007030
2017-01-03 metric3 0.041405 0.913016
2017-01-04 metric1 -0.102361 -0.044124
2017-01-05 metric2 0.114260 0.031469
2017-01-06 metric3 0.316760 -0.113277
2017-01-07 metric1 3.049462 0.052456
2017-01-08 metric2 -0.050300 -0.015628
2017-01-09 metric3 0.004769 0.239641
2017-01-10 metric1 0.025574 0.153893
2017-01-11 metric2 0.111758 0.083404
2017-01-12 metric3 -0.175687 -0.107851
2017-01-13 metric1 0.147153 0.266303
2017-01-14 metric2 -0.162214 -0.238798
2017-01-15 metric3 0.137627 0.010475
2017-01-16 metric1 -0.223583 -0.208177
2017-01-17 metric2 0.154821 0.189663
2017-01-18 metric3 -0.161725 -0.536955
2017-01-19 metric1 -0.002525 0.027977
2017-01-20 metric2 -0.210697 0.564725
2017-01-21 metric3 -0.228038 -0.255461
2017-01-22 metric1 -0.210647 -0.141039
2017-01-23 metric2 0.354086 -0.366433
2017-01-24 metric3 0.344310 -0.045895
2017-01-25 metric1 0.340080 0.105040
2017-01-26 metric2 2.512369 -0.062200
2017-01-27 metric3 -1.326842 -1.819911
保持与df
相同的数据框,但附加两个新列......
def proc(df):
s = df.groupby('group').sum()
r = s.n_events / s.total_users
return r.drop('control').sub(r.loc['control'])
gcols = ['date', 'metric']
ocols = ['group', 'n_events', 'total_users']
suffix = '_standard_uplift'
df.join(df.groupby(gcols)[ocols].apply(proc).add_suffix(suffix), on=gcols).sort_index()