计算熊猫组中的 NaN 值

时间:2021-02-05 13:56:36

标签: python pandas

我有一个像这样的 df

Country     product                 date_install                date_purchase           id
BR          yearly                  2020-11-01-01:11:36         2020-11-01-01:11:26     10660236
CA          monthly                 2020-11-01-01:11:49         2020-11-01-01:11:32     10649441
US          yearly                  2020-11-01-01:11:54         2020-11-01-01:11:33     10660272
IT          monthly                 2020-11-01-11:11:01         2020-11-01-01:11:34     10657634
AE          monthly                 2020-11-01-01:11:38         2020-11-01-01:11:39     10661442
US          NaN                     2021-01-12-03:01:31         NaN                     12815946
CA          NaN                     2020-12-04-02:12:48         NaN                     11647714
US          NaN                     2020-12-28-11:12:54         NaN                     12323174
ID          NaN                     2021-02-02-01:02:58         NaN                     13714980
US          NaN                     2020-11-15-10:11:05         NaN                     11056138

我想得到这个:

country     product     installs        purchases
BR          yearly      1               1
BR          NaN         100             0 # people who installed but not purchased
CA          monthly     1               1
US          yearly      10              10
US          monthly     15              15
US          NaN         500             0 # people who installed but not purchased

或者更好的是:

country     installs    yearly  monthly  total
BR          1000        10      100      110
CA          2000        50      5        55

我试过了:

df.groupby(['country','product']).count().sort_values('date_install',ascending=False)

但是所有的值都是相同的,匹配购买的数量,这意味着每个安装购买的人。

                    date_install    date_purchase   id
country product         
US      monthly     3373            3373            3373
AU      monthly     1478            1478            1478
US      yearly      954             954             954

如果我使用:

df = df.replace(np.nan, 'empty', regex=True)
df.groupby(['country','product']).count().sort_values('date_install',ascending=False)

我明白了:

                    date_install    date_purchase   id
country product         
US      empty       480153          480153          480153
AU      empty       334236          334236          334236
BR      empty       144920          144920          144920

我怎样才能达到这个结果?

1 个答案:

答案 0 :(得分:1)

确实,如果您遵循@Paul Brennan 的建议,解决方案就会变得非常容易。例如考虑以下数据

   Country  product         date_install        date_purchase        id
0       BR   yearly  2020-01-01-01:00:00  2020-01-01-01:00:00  10660236
3       BR  monthly  2020-01-01-04:00:00  2020-01-01-04:00:00  10660239
6       BR      NaN  2020-01-01-07:00:00                  NaN  10660242
9       BR      NaN  2020-01-01-10:00:00                  NaN  10660245
1       CA   yearly  2020-01-01-02:00:00  2020-01-01-02:00:00  10660237
4       CA   yearly  2020-01-01-05:00:00  2020-01-01-05:00:00  10660240
7       CA      NaN  2020-01-01-08:00:00                  NaN  10660243
10      CA   yearly  2020-01-01-11:00:00  2020-01-01-11:00:00  10660246
2       US  monthly  2020-01-01-03:00:00  2020-01-01-03:00:00  10660238
5       US      NaN  2020-01-01-06:00:00                  NaN  10660241
8       US  monthly  2020-01-01-09:00:00  2020-01-01-09:00:00  10660244
11      US  monthly  2020-01-01-12:00:00  2020-01-01-12:00:00  10660247

假设“未购买”版本是 demo 或类似的东西:

df['product'] = df['product'].fillna('demo')

您可以执行以下操作

ans = (df.groupby([df['Country'], df['product']])
       .apply(len)
       .unstack()
       .fillna(0)
       .astype(int)
       .rename_axis(columns='', index='')
       .assign(installed=lambda x: x[['demo', 'monthly', 'yearly']].sum(axis=1),
               purchased=lambda x: x[['monthly', 'yearly']].sum(axis=1))
       )

结果数据框如下:

    demo  monthly  yearly  installed  purchased
                                               
BR     2        1       1          4          2
CA     1        0       3          4          3
US     1        3       0          4          3

回答您的评论问题时,您无法保留每个用户的日期,因为 groupby 汇总了所有信息,丢失了这些个人详细信息。

可以做的是将结果数据框中所需的列分配给第一个(获取一些重复值),例如:

df = df.assign(purchased=df['Country'].map(ans['purchased']),
               installed=df['Country'].map(ans['installed']))

这将使您的第一个数据框看起来像:

   Country  product         date_install        date_purchase        id  purchased  installed
0       BR   yearly  2020-01-01-01:00:00  2020-01-01-01:00:00  10660236          2          4
1       CA   yearly  2020-01-01-02:00:00  2020-01-01-02:00:00  10660237          3          4
2       US  monthly  2020-01-01-03:00:00  2020-01-01-03:00:00  10660238          3          4
.
.
.

如果这不是您想要的,请告诉我们,我们会尽力解决。