我有一个像这样的 df
:
Country product date_install date_purchase id
BR yearly 2020-11-01-01:11:36 2020-11-01-01:11:26 10660236
CA monthly 2020-11-01-01:11:49 2020-11-01-01:11:32 10649441
US yearly 2020-11-01-01:11:54 2020-11-01-01:11:33 10660272
IT monthly 2020-11-01-11:11:01 2020-11-01-01:11:34 10657634
AE monthly 2020-11-01-01:11:38 2020-11-01-01:11:39 10661442
US NaN 2021-01-12-03:01:31 NaN 12815946
CA NaN 2020-12-04-02:12:48 NaN 11647714
US NaN 2020-12-28-11:12:54 NaN 12323174
ID NaN 2021-02-02-01:02:58 NaN 13714980
US NaN 2020-11-15-10:11:05 NaN 11056138
我想得到这个:
country product installs purchases
BR yearly 1 1
BR NaN 100 0 # people who installed but not purchased
CA monthly 1 1
US yearly 10 10
US monthly 15 15
US NaN 500 0 # people who installed but not purchased
或者更好的是:
country installs yearly monthly total
BR 1000 10 100 110
CA 2000 50 5 55
我试过了:
df.groupby(['country','product']).count().sort_values('date_install',ascending=False)
但是所有的值都是相同的,匹配购买的数量,这意味着每个安装购买的人。
date_install date_purchase id
country product
US monthly 3373 3373 3373
AU monthly 1478 1478 1478
US yearly 954 954 954
如果我使用:
df = df.replace(np.nan, 'empty', regex=True)
df.groupby(['country','product']).count().sort_values('date_install',ascending=False)
我明白了:
date_install date_purchase id
country product
US empty 480153 480153 480153
AU empty 334236 334236 334236
BR empty 144920 144920 144920
我怎样才能达到这个结果?
答案 0 :(得分:1)
确实,如果您遵循@Paul Brennan 的建议,解决方案就会变得非常容易。例如考虑以下数据
Country product date_install date_purchase id
0 BR yearly 2020-01-01-01:00:00 2020-01-01-01:00:00 10660236
3 BR monthly 2020-01-01-04:00:00 2020-01-01-04:00:00 10660239
6 BR NaN 2020-01-01-07:00:00 NaN 10660242
9 BR NaN 2020-01-01-10:00:00 NaN 10660245
1 CA yearly 2020-01-01-02:00:00 2020-01-01-02:00:00 10660237
4 CA yearly 2020-01-01-05:00:00 2020-01-01-05:00:00 10660240
7 CA NaN 2020-01-01-08:00:00 NaN 10660243
10 CA yearly 2020-01-01-11:00:00 2020-01-01-11:00:00 10660246
2 US monthly 2020-01-01-03:00:00 2020-01-01-03:00:00 10660238
5 US NaN 2020-01-01-06:00:00 NaN 10660241
8 US monthly 2020-01-01-09:00:00 2020-01-01-09:00:00 10660244
11 US monthly 2020-01-01-12:00:00 2020-01-01-12:00:00 10660247
假设“未购买”版本是 demo
或类似的东西:
df['product'] = df['product'].fillna('demo')
您可以执行以下操作
ans = (df.groupby([df['Country'], df['product']])
.apply(len)
.unstack()
.fillna(0)
.astype(int)
.rename_axis(columns='', index='')
.assign(installed=lambda x: x[['demo', 'monthly', 'yearly']].sum(axis=1),
purchased=lambda x: x[['monthly', 'yearly']].sum(axis=1))
)
结果数据框如下:
demo monthly yearly installed purchased
BR 2 1 1 4 2
CA 1 0 3 4 3
US 1 3 0 4 3
回答您的评论问题时,您无法保留每个用户的日期,因为 groupby
汇总了所有信息,丢失了这些个人详细信息。
可以做的是将结果数据框中所需的列分配给第一个(获取一些重复值),例如:
df = df.assign(purchased=df['Country'].map(ans['purchased']),
installed=df['Country'].map(ans['installed']))
这将使您的第一个数据框看起来像:
Country product date_install date_purchase id purchased installed
0 BR yearly 2020-01-01-01:00:00 2020-01-01-01:00:00 10660236 2 4
1 CA yearly 2020-01-01-02:00:00 2020-01-01-02:00:00 10660237 3 4
2 US monthly 2020-01-01-03:00:00 2020-01-01-03:00:00 10660238 3 4
.
.
.
如果这不是您想要的,请告诉我们,我们会尽力解决。