Question

我有一个像这样的 df：

country product     date_install        date_purchase        id                 date_authentication
DK      NaN         2020-12-28          NaN                  12343323           NaN
GB      NaN         2021-01-10          NaN                  12752971           NaN
UA      monthly     2020-11-05          2021-01-15           10766369           2021-01-15
PL      NaN         2021-01-24          NaN                  13314244           NaN
MX      NaN         2020-12-11          NaN                  11856945           NaN
GB      NaN         2020-12-02          NaN                  11607569           NaN
IT      yearly      2021-02-07          2021-01-15           13919183           NaN
KZ      NaN         2020-12-04          NaN                  11655951           2021-01-12
UA      NaN         2020-11-27          NaN                  11436990           NaN
US      NaN         2021-01-08          NaN                  12682751           2021-01-12

我正在尝试获得这样的 df：

country product     date_install        installs    purchases     registrations  ratio
US      daily       2021-02-05          100         20            30             0.2
US      monthly     2021-02-05          100         50            40             0.5
US      yearly      2021-02-05          100         50            20             0.5             
US      NaN         2021-02-05          100         0             45              0    
# the next day
US      daily       2021-02-06          500         50            300            0.1
US      monthly     2021-02-06          500         100           267            0.2
US      yearly      2021-02-06          500         250           123            0.5             
US      NaN         2021-02-06          500         0             312            0    
# the rest of the countries & the rest of the days

这是 question on this 的跟进。

我试过了：

df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)

exp = (df.groupby(['country','product','date_install'])
         .agg(installs = ('date_install','size'), purchases = ('date_purchase','count'),registrations = ('date_purchase','count')))

exp['installs'] = exp.groupby(['country','date_install'])['installs'].transform('sum')
exp['registrations'] = exp.groupby(['country','date_install'])['registrations'].transform('sum')
exp['ratio'] = exp['purchases'].div(exp['installs'])

exp = exp.reset_index()

如果我跑：

print(df['date_install'].count())
print(df['date_authentication'].count())
print(df['date_purchase'].count())

我明白了：

2496159 # installs
112535 # registrations
24311 # purchases

当我跑步时：

print(exp.installs.sum())
print(exp.registrations.sum())
print(exp.purchases.sum())

我明白了：

53993 # installs
29758 # registrations
24216 # purchases

但是c = df.groupby(['date_install','country']).count()

print(c['date_authentication'].sum())
print(c['date_purchase'].sum())

返回：

111941 # regisrations
24216 # purchases

所以此时我不确定如何在不丢失数据的情况下获得所需的结构，因为安装/注册/购买在我对它们进行分组后应该保持不变，对吗？

我注意到当我在分组前使用 df['product'] = df['product'] .fillna('None') 时：

print(exp.installs.sum())
print(exp.registrations.sum())
print(exp.purchases.sum())

返回：

6674304 # installs
308911 # registrations
24216 # purchases

这更有意义，但现在我的安装量和注册量比以前更多，但 +- 购买数量相同。

计算groupby中每列的值

0 个答案: