我有一个像这样的 df
:
country product date_install date_purchase date_authentication user_id
BR yearly 2020-11-01 2020-11-01 2020-11-01 10660236
CA monthly 2020-11-01 trialed trialed 0649441
US yearly 2020-11-01 trialed 2020-11-01 10660272
IT monthly 2020-11-01 2020-11-01 2020-11-01 10657634
AE monthly 2020-11-01 2020-11-01 2020-11-01 10661442
IT monthly 2020-11-01 trialed trialed 10657634
AE monthly 2020-11-01 trialed 2020-11-05 10661442
我想得到:
country product date_install installs purchases registrations ratio
US daily 2021-02-05 100 20 30 0.2
US monthly 2021-02-05 100 50 40 0.5
US yearly 2021-02-05 100 50 20 0.5
US trialed 2021-02-05 100 0 45 0
# the next day
US daily 2021-02-06 500 50 300 0.1
US monthly 2021-02-06 500 100 267 0.2
US yearly 2021-02-06 500 250 123 0.5
US trialed 2021-02-06 500 0 312 0
# the rest of the countries & the rest of the days
我正在尝试获取购买/安装的比率以及每个国家/地区、产品和日期的实际安装、注册和购买数量。date_install
是安装日期,date_authentication
是日期注册后,date_purchase
确定购买日期和购买已发生,trialed
中的 date_purchase
值表示没有为具有 user_id 的用户进行购买,{{1} trialed
中的 } 表示用户尚未注册。
date_authentication
的计数需要是当天总安装量的总和,与 installs
相同。
使用并尝试更新 jezrael's 后回答:
registrations
但是 df['date_purchase'] = df['date_purchase'].replace('trialed', np.nan)
df['date_authentication'] = df['date_authentication'].replace('trialed', np.nan)
print(df['date_install'].count())
print(df['date_authentication'].count())
print(df['date_purchase'].count())
# 2496159
# 112535
# 24311
exp = (df.groupby(['country','product','date_install']).agg(installs = ('date_purchase','size'), purchases = ('date_purchase','count'),registrations = ('date_authentication','count')))
exp['installs'] = exp.groupby(['country','date_install'])['installs'].transform('sum')
exp['registrations'] = exp.groupby(['country','date_install'])['registrations'].transform('sum')
exp['ratio'] = exp['purchases'].div(exp['installs'])
exp = exp.reset_index()
exp
对每个指标都有相同的计数,而很明显 exp
:
installs>registrations>purchases
我的错误在哪里?我试图让 print(exp['installs'].count())
print(exp['purchases'].count())
print(exp['registrations'].count())
# 5035
# 5035
# 5035
、date_x
和 country, product, date_install
每 install
的 registration
事件计数由 purchase
、{{ 1}} 和 date_install
其中值是日期而不是 date_authentication
?
更新
date_purchase
nan / trialed
print(exp.isna().sum())
返回预期内容:
country 0
product 0
date_install 0
date_authentication 0
installs 0
purchases 0
registrations 0
ratio_from_install 0
ratio_from_registration 0
sum
<块引用>
TypeError: 不支持 + 的操作数类型:'int' 和 'str'
如何获得 print(exp['installs'].sum())
print(exp['registrations'].sum())
print(exp['purchases'].sum())
# 143090
# 95860
# 13136
中的真实数字?