Question

的跟进

我有一个像这样的 df：

country product     date_install    date_purchase     date_authentication           user_id
BR      yearly      2020-11-01      2020-11-01        2020-11-01                    10660236
CA      monthly     2020-11-01      trialed           trialed                            0649441
US      yearly      2020-11-01      trialed           2020-11-01                       10660272
IT      monthly     2020-11-01      2020-11-01        2020-11-01                    10657634
AE      monthly     2020-11-01      2020-11-01        2020-11-01                   10661442
IT      monthly     2020-11-01      trialed           trialed                       10657634
AE      monthly     2020-11-01      trialed           2020-11-05                    10661442

我想得到：

country product     date_install        installs    purchases     registrations  ratio
US      daily       2021-02-05          100         20            30             0.2
US      monthly     2021-02-05          100         50            40             0.5
US      yearly      2021-02-05          100         50            20             0.5             
US      trialed     2021-02-05          100         0             45              0    
# the next day
US      daily       2021-02-06          500         50            300            0.1
US      monthly     2021-02-06          500         100           267            0.2
US      yearly      2021-02-06          500         250           123            0.5             
US      trialed     2021-02-06          500         0             312            0    
# the rest of the countries & the rest of the days

我正在尝试获取购买/安装的比率以及每个国家/地区、产品和日期的实际安装、注册和购买数量。date_install 是安装日期，date_authentication 是日期注册后，date_purchase 确定购买日期和购买已发生，trialed 中的 date_purchase 值表示没有为具有 user_id 的用户进行购买，{{1} trialed 中的 } 表示用户尚未注册。

date_authentication 的计数需要是当天总安装量的总和，与 installs 相同。

使用并尝试更新 jezrael's 后回答：

registrations

但是 df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan) df['date_authentication'] = df['date_authentication'].replace('trialed', np.nan) print(df['date_install'].count()) print(df['date_authentication'].count()) print(df['date_purchase'].count()) # 2496159 # 112535 # 24311 exp = (df.groupby(['country','product','date_install']).agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count'),registrations = ('date_authentication','count'))) exp['installs'] = exp.groupby(['country','date_install'])['installs'].transform('sum') exp['registrations'] = exp.groupby(['country','date_install'])['registrations'].transform('sum') exp['ratio'] = exp['purchases'].div(exp['installs']) exp = exp.reset_index() exp 对每个指标都有相同的计数，而很明显 exp：

installs>registrations>purchases

我的错误在哪里？我试图让 print(exp['installs'].count()) print(exp['purchases'].count()) print(exp['registrations'].count()) # 5035 # 5035 # 5035、date_x 和 country, product, date_install 每 install 的 registration 事件计数由 purchase、{{ 1}} 和 date_install 其中值是日期而不是 date_authentication？

更新

date_purchase

nan / trialed

print(exp.isna().sum()) 返回预期内容：

country                    0
product                    0
date_install               0
date_authentication        0
installs                   0
purchases                  0
registrations              0
ratio_from_install         0
ratio_from_registration    0

sum

<块引用>

TypeError: 不支持 + 的操作数类型：'int' 和 'str'

如何获得 print(exp['installs'].sum()) print(exp['registrations'].sum()) print(exp['purchases'].sum()) # 143090 # 95860 # 13136 中的真实数字？

计算熊猫分组中非 NaN 的值

0 个答案: