计算熊猫分组中非 NaN 的值

时间:2021-02-10 08:29:42

标签: python pandas

这是对this question

的跟进

我有一个像这样的 df

country product     date_install    date_purchase     date_authentication           user_id
BR      yearly      2020-11-01      2020-11-01        2020-11-01                    10660236
CA      monthly     2020-11-01      trialed           trialed                            0649441
US      yearly      2020-11-01      trialed           2020-11-01                       10660272
IT      monthly     2020-11-01      2020-11-01        2020-11-01                    10657634
AE      monthly     2020-11-01      2020-11-01        2020-11-01                   10661442
IT      monthly     2020-11-01      trialed           trialed                       10657634
AE      monthly     2020-11-01      trialed           2020-11-05                    10661442

我想得到:

country product     date_install        installs    purchases     registrations  ratio
US      daily       2021-02-05          100         20            30             0.2
US      monthly     2021-02-05          100         50            40             0.5
US      yearly      2021-02-05          100         50            20             0.5             
US      trialed     2021-02-05          100         0             45              0    
# the next day
US      daily       2021-02-06          500         50            300            0.1
US      monthly     2021-02-06          500         100           267            0.2
US      yearly      2021-02-06          500         250           123            0.5             
US      trialed     2021-02-06          500         0             312            0    
# the rest of the countries & the rest of the days

我正在尝试获取购买/安装的比率以及每个国家/地区、产品和日期的实际安装、注册和购买数量。date_install 是安装日期,date_authentication 是日期注册后,date_purchase 确定购买日期和购买已发生,trialed 中的 date_purchase 值表示没有为具有 user_id 的用户进行购买,{{1} trialed 中的 } 表示用户尚未注册。

date_authentication 的计数需要是当天总安装量的总和,与 installs 相同。

使用并尝试更新 jezrael's 后回答:

registrations

但是 df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan) df['date_authentication'] = df['date_authentication'].replace('trialed', np.nan) print(df['date_install'].count()) print(df['date_authentication'].count()) print(df['date_purchase'].count()) # 2496159 # 112535 # 24311 exp = (df.groupby(['country','product','date_install']).agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count'),registrations = ('date_authentication','count'))) exp['installs'] = exp.groupby(['country','date_install'])['installs'].transform('sum') exp['registrations'] = exp.groupby(['country','date_install'])['registrations'].transform('sum') exp['ratio'] = exp['purchases'].div(exp['installs']) exp = exp.reset_index() exp 对每个指标都有相同的计数,而很明显 exp

installs>registrations>purchases

我的错误在哪里?我试图让 print(exp['installs'].count()) print(exp['purchases'].count()) print(exp['registrations'].count()) # 5035 # 5035 # 5035 date_xcountry, product, date_installinstallregistration 事件计数由 purchase、{{ 1}} 和 date_install 其中值是日期而不是 date_authentication

更新

date_purchase
nan / trialed

print(exp.isna().sum()) 返回预期内容:

country                    0
product                    0
date_install               0
date_authentication        0
installs                   0
purchases                  0
registrations              0
ratio_from_install         0
ratio_from_registration    0
sum
<块引用>

TypeError: 不支持 + 的操作数类型:'int' 和 'str'

如何获得 print(exp['installs'].sum()) print(exp['registrations'].sum()) print(exp['purchases'].sum()) # 143090 # 95860 # 13136 中的真实数字?

0 个答案:

没有答案