在熊猫中分组并用值前后的均值填充NaN

时间:2019-12-11 06:25:20

标签: python python-3.x pandas

我尝试用NaNmeans值中的before填充after单元。

   type     date        v1       v2
0     a  2018-09  21511.11  17696.8
1     a  2018-10       NaN      NaN
2     a  2018-11       NaN      NaN
3     a  2018-12  30319.98  24553.6
4     a  2019-01       NaN      NaN
5     a  2019-02       NaN      NaN
6     a  2019-03   7409.61   6110.0
7     a  2019-04       NaN      NaN
8     a  2019-05       NaN      NaN
9     a  2019-06  15212.51  12590.5
10    a  2019-07       NaN      NaN
11    a  2019-08       NaN      NaN
12    a  2019-09  23129.96  19160.9
13    a  2019-10       NaN      NaN
14    a  2019-11       NaN      NaN
15    b  2018-09  21511.11  17696.8
16    b  2018-10       NaN      NaN
17    b  2018-11       NaN      NaN
18    b  2018-12  30319.98  24553.6
19    b  2019-01       NaN      NaN
20    b  2019-02       NaN      NaN
21    b  2019-03   7409.61   6110.0
22    b  2019-04       NaN      NaN
23    b  2019-05       NaN      NaN
24    b  2019-06  15212.51  12590.5
25    b  2019-07       NaN      NaN
26    b  2019-08       NaN      NaN
27    b  2019-09  23129.96  19160.9
28    b  2019-10       NaN      NaN
29    b  2019-11       NaN      NaN

我尝试从here参考以下代码:

df[['v1', 'v2']] = (df[['v1', 'v2']].ffill()+df[['v1', 'v2']].bfill())/2
df[['v1', 'v2']] = df[['v1', 'v2']].bfill().ffill()

我得到:

   type     date         v1        v2
0     a  2018-09  21511.110  17696.80
1     a  2018-10  25915.545  21125.20
2     a  2018-11  25915.545  21125.20
3     a  2018-12  30319.980  24553.60
4     a  2019-01  18864.795  15331.80
5     a  2019-02  18864.795  15331.80
6     a  2019-03   7409.610   6110.00
7     a  2019-04  11311.060   9350.25
8     a  2019-05  11311.060   9350.25
9     a  2019-06  15212.510  12590.50
10    a  2019-07  19171.235  15875.70
11    a  2019-08  19171.235  15875.70
12    a  2019-09  23129.960  19160.90
13    a  2019-10  22320.535  18428.85
14    a  2019-11  22320.535  18428.85
15    b  2018-09  21511.110  17696.80
16    b  2018-10  25915.545  21125.20
17    b  2018-11  25915.545  21125.20
18    b  2018-12  30319.980  24553.60
19    b  2019-01  18864.795  15331.80
20    b  2019-02  18864.795  15331.80
21    b  2019-03   7409.610   6110.00
22    b  2019-04  11311.060   9350.25
23    b  2019-05  11311.060   9350.25
24    b  2019-06  15212.510  12590.50
25    b  2019-07  19171.235  15875.70
26    b  2019-08  19171.235  15875.70
27    b  2019-09  23129.960  19160.90
28    b  2019-10  23129.960  19160.90
29    b  2019-11  23129.960  19160.90

但是我不知道如何对type进行分组并应用上面的代码。有人可以帮忙吗?谢谢。

2 个答案:

答案 0 :(得分:3)

groupby添加到要处理的列的列表中,还使用每个组的第一个和最后一个缺失值apply,以避免在仅存在一些NaN的情况下从一组值替换为另一组值组中的s值:

g = df.groupby('type')['v1', 'v2']
df[['v1', 'v2']] = (g.ffill()+g.bfill())/2

df[['v1', 'v2']] = g.apply(lambda x: x.bfill().ffill())

仅数字列的解决方案:

cols = df.select_dtypes(np.number).columns

g = df.groupby('type')[cols]
df[cols] = (g.ffill()+g.bfill())/2
df[cols] = g.apply(lambda x: x.bfill().ffill())

答案 1 :(得分:2)

就像你说的那样:

 df[['v1','v2']] = (df.groupby('type')[['v1','v2']]
                      .agg(['bfill','ffill'])
                      .groupby(level=0, axis=1)
                      .mean()
                   )