我有一个像这样的 df
:
country product date_install date_purchase id date_authentication
DK NaN 2020-12-28 NaN 12343323 NaN
GB NaN 2021-01-10 NaN 12752971 NaN
UA monthly 2020-11-05 2021-01-15 10766369 2021-01-15
PL NaN 2021-01-24 NaN 13314244 NaN
MX NaN 2020-12-11 NaN 11856945 NaN
GB NaN 2020-12-02 NaN 11607569 NaN
IT yearly 2021-02-07 2021-01-15 13919183 NaN
KZ NaN 2020-12-04 NaN 11655951 2021-01-12
UA NaN 2020-11-27 NaN 11436990 NaN
US NaN 2021-01-08 NaN 12682751 2021-01-12
我正在尝试获得这样的 df:
country product date_install installs purchases registrations ratio
US daily 2021-02-05 100 20 30 0.2
US monthly 2021-02-05 100 50 40 0.5
US yearly 2021-02-05 100 50 20 0.5
US NaN 2021-02-05 100 0 45 0
# the next day
US daily 2021-02-06 500 50 300 0.1
US monthly 2021-02-06 500 100 267 0.2
US yearly 2021-02-06 500 250 123 0.5
US NaN 2021-02-06 500 0 312 0
# the rest of the countries & the rest of the days
这是 question on this 的跟进。
我试过了:
df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)
exp = (df.groupby(['country','product','date_install'])
.agg(installs = ('date_install','size'), purchases = ('date_purchase','count'),registrations = ('date_purchase','count')))
exp['installs'] = exp.groupby(['country','date_install'])['installs'].transform('sum')
exp['registrations'] = exp.groupby(['country','date_install'])['registrations'].transform('sum')
exp['ratio'] = exp['purchases'].div(exp['installs'])
exp = exp.reset_index()
如果我跑:
print(df['date_install'].count())
print(df['date_authentication'].count())
print(df['date_purchase'].count())
我明白了:
2496159 # installs
112535 # registrations
24311 # purchases
当我跑步时:
print(exp.installs.sum())
print(exp.registrations.sum())
print(exp.purchases.sum())
我明白了:
53993 # installs
29758 # registrations
24216 # purchases
但是c = df.groupby(['date_install','country']).count()
print(c['date_authentication'].sum())
print(c['date_purchase'].sum())
返回:
111941 # regisrations
24216 # purchases
所以此时我不确定如何在不丢失数据的情况下获得所需的结构,因为安装/注册/购买在我对它们进行分组后应该保持不变,对吗?
我注意到当我在分组前使用 df['product'] = df['product'] .fillna('None')
时:
print(exp.installs.sum())
print(exp.registrations.sum())
print(exp.purchases.sum())
返回:
6674304 # installs
308911 # registrations
24216 # purchases
这更有意义,但现在我的安装量和注册量比以前更多,但 +- 购买数量相同。