Question

我正在进行同期分析（我跟随this link但是正在适应我的需求）我有一个这样的数据框：

> df.head()
id          user_created_at   status_id     status_period     cohort_group
30322300    2017-12-17        30322311.0    2017-12           2017-12
30322268    2017-12-17        NaN           NaN               2017-12
12463236    2017-05-24        NaN           NaN               2017-05
16454748    2017-08-10        16455080.0    2017-08           2017-08
4100773     2017-02-24        4153065.0     2017-02           2017-02

我按['cohort_group', 'status_period']对其进行分组，并使用agg方法预计会计算所有id和status_id的数量。

df.groupby(['cohort_group', 'status_period']).grouped.agg(
                      {'id': pd.Series.nunique,
                       'status_id': pd.Series.nunique,
                      })

                                 id status_id 

cohort_group    status_period       
2015-02             2015-02       3     3.0
                    2015-03       2     2.0
                    2015-05       1     1.0
                    2015-06       1     1.0
                    2016-01       1     1.0

2015-03             2015-03       126   126.0
                    2015-05       13    13.0
                    2015-07       1     1.0
                    2016-06       1     1.0

2015-04             2015-04       120   120.0
                    2015-05       479   479.0
                    2015-06       1     1.0
...

由于有些行的status_id为NaN，我希望id的数量高于status_id，但我相信在使用groupby后，未考虑status_period为NaN的行，导致两列的值相同。

如何考虑agg方法中的所有行，即使status_period为NaN的那些行？

Answer 1

当对非分类数据进行分组时，Pandas目前会丢弃Nan值。解决此问题的方法是使用fillna这样的内容，请参阅pandas docs：

df.fillna(-1).groupby(['cohort_group','status_period']).agg(
                      {'id': pd.Series.nunique,
                       'status_id': pd.Series.nunique,
                      })

输出：

                            status_id  id
cohort_group status_period               
2017-02      2017-02              1.0   1
2017-05      -1                   1.0   1
2017-08      2017-08              1.0   1
2017-12      -1                   1.0   1
             2017-12              1.0   1

Answer 2

您还可以在分组之前更改列的数据类型：

df.astype({'status_period':'str'}).groupby(['cohort_group', 'status_period']).agg(
                      {'id': pd.Series.nunique,
                       'status_id': pd.Series.nunique,
                      })

输出：

                            id  status_id
cohort_group status_period
2017-02      2017-02         1        1.0
2017-05      nan             1        0.0
2017-08      2017-08         1        1.0
2017-12      2017-12         1        1.0
             nan             1        0.0

如何在按某些值为NaN的列分组后考虑agg方法中的所有行？

2 个答案: