填写每个组的缺失日期并在Pandas中估算空值

时间:2019-12-25 10:39:57

标签: python-3.x pandas dataframe datetime

对于以下数据框,如何填充每个组citydistrict的缺失日期,假设整个日期范围是从2019/1/12019/6/1,然后填充空的value和单元格之前和之后的mean s,如果之前或之后没有值,则使用bfillffill

   city district      date  value
0     a        d  2019/1/1   9.99
1     a        d  2019/2/1  10.66
2     a        d  2019/3/1  10.56
3     a        d  2019/4/1  10.06
4     a        d  2019/5/1  10.69
5     a        d  2019/6/1  10.77
6     b        e  2019/1/1   9.72
7     b        e  2019/2/1   9.72
8     b        e  2019/4/1   9.78
9     b        e  2019/5/1   9.76
10    b        e  2019/6/1   9.66
11    c        f  2019/4/1   9.57
12    c        f  2019/5/1   9.47
13    c        f  2019/6/1   9.39

预期结果如下:

   city district      date  value
0     a        d  2019/1/1   9.99
1     a        d  2019/2/1  10.66
2     a        d  2019/3/1  10.56
3     a        d  2019/4/1  10.06
4     a        d  2019/5/1  10.69
5     a        d  2019/6/1  10.77
6     b        e  2019/1/1   9.72
7     b        e  2019/2/1   9.72
8     b        e  2019/3/1   9.75
9     b        e  2019/4/1   9.78
10    b        e  2019/5/1   9.76
11    b        e  2019/6/1   9.66
12    c        f  2019/1/1   9.57
13    c        f  2019/2/1   9.57
14    c        f  2019/3/1   9.57
15    c        f  2019/4/1   9.57
16    c        f  2019/5/1   9.47
17    c        f  2019/6/1   9.39

如何在熊猫中做到这一点?非常感谢。

更新: 当我添加freq = 'M'时,全部变成NaN

df['date']=pd.to_datetime(df['date'])
( df.set_index('date')
  .groupby(['city','district'],as_index=False)
  .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq = 'M'))
                    .interpolate()
                    .bfill()
                    .ffill())
  .rename_axis(index = [0,'date'])
  .reset_index()
  .drop(0,axis=1)
)

输出:

         date  city  district  value
0  2019-01-31   NaN       NaN    NaN
1  2019-02-28   NaN       NaN    NaN
2  2019-03-31   NaN       NaN    NaN
3  2019-04-30   NaN       NaN    NaN
4  2019-05-31   NaN       NaN    NaN
5  2019-01-31   NaN       NaN    NaN
6  2019-02-28   NaN       NaN    NaN
7  2019-03-31   NaN       NaN    NaN
8  2019-04-30   NaN       NaN    NaN
9  2019-05-31   NaN       NaN    NaN
10 2019-01-31   NaN       NaN    NaN
11 2019-02-28   NaN       NaN    NaN
12 2019-03-31   NaN       NaN    NaN
13 2019-04-30   NaN       NaN    NaN
14 2019-05-31   NaN       NaN    NaN

3 个答案:

答案 0 :(得分:2)

我们可以做到:

df['date']=pd.to_datetime(df['date'],format ='%YYYY/%dd/%mm' )

( df.set_index('date')
  .groupby(['city','district'],as_index=False)
  .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max()))
                    .interpolate()
                    .bfill()
                    .ffill())
  .rename_axis(index = [0,'date'])
  .reset_index()
  .drop(0,axis=1)

)

输出

                  date city district  value
0  2019-01-01 00:01:00    a        d   9.99
1  2019-01-02 00:01:00    a        d  10.66
2  2019-01-03 00:01:00    a        d  10.56
3  2019-01-04 00:01:00    a        d  10.06
4  2019-01-05 00:01:00    a        d  10.69
5  2019-01-06 00:01:00    a        d  10.77
6  2019-01-01 00:01:00    b        e   9.72
7  2019-01-02 00:01:00    b        e   9.72
8  2019-01-03 00:01:00    b        e   9.75
9  2019-01-04 00:01:00    b        e   9.78
10 2019-01-05 00:01:00    b        e   9.76
11 2019-01-06 00:01:00    b        e   9.66
12 2019-01-01 00:01:00    c        f   9.57
13 2019-01-02 00:01:00    c        f   9.57
14 2019-01-03 00:01:00    c        f   9.57
15 2019-01-04 00:01:00    c        f   9.57
16 2019-01-05 00:01:00    c        f   9.47
17 2019-01-06 00:01:00    c        f   9.39

答案 1 :(得分:1)

您可以使用每组替换misisng值来更改您的解决方案,如果每组仅NaN个值,则可以避免错误替换:

df['date']=pd.to_datetime(df['date'])

rng = pd.date_range('2019-01-01', '2019-06-01', freq='MS')
c = df['city'].unique()
mux = pd.MultiIndex.from_product([c, rng], names=['city', 'date'])

df1 = (df.set_index(['city', 'date']).reindex(mux, method='ffill')
       .groupby(level=0)
       .apply(lambda x: x.bfill().ffill())
       .reset_index())
print (df1)
   city       date district  value
0     a 2019-01-01        d   9.99
1     a 2019-02-01        d  10.66
2     a 2019-03-01        d  10.56
3     a 2019-04-01        d  10.06
4     a 2019-05-01        d  10.69
5     a 2019-06-01        d  10.77
6     b 2019-01-01        e   9.72
7     b 2019-02-01        e   9.72
8     b 2019-03-01        e   9.72
9     b 2019-04-01        e   9.78
10    b 2019-05-01        e   9.76
11    b 2019-06-01        e   9.66
12    c 2019-01-01        e   9.66
13    c 2019-02-01        e   9.66
14    c 2019-03-01        e   9.66
15    c 2019-04-01        f   9.57
16    c 2019-05-01        f   9.47
17    c 2019-06-01        f   9.39

或对reindexmethod='bfill'使用自定义功能:

df2 = (df.set_index('date')
         .groupby(['city','district'], group_keys=False)
         .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq='MS'), method='bfill')
                           .ffill())
         .rename_axis('date')
         .reset_index())
print (df2)
         date city district  value
0  2019-01-01    a        d   9.99
1  2019-02-01    a        d  10.66
2  2019-03-01    a        d  10.56
3  2019-04-01    a        d  10.06
4  2019-05-01    a        d  10.69
5  2019-06-01    a        d  10.77
6  2019-01-01    b        e   9.72
7  2019-02-01    b        e   9.72
8  2019-03-01    b        e   9.78
9  2019-04-01    b        e   9.78
10 2019-05-01    b        e   9.76
11 2019-06-01    b        e   9.66
12 2019-01-01    c        f   9.57
13 2019-02-01    c        f   9.57
14 2019-03-01    c        f   9.57
15 2019-04-01    c        f   9.57
16 2019-05-01    c        f   9.47
17 2019-06-01    c        f   9.39 

使用interpolate的解决方案:

df2 = (df.set_index('date')
         .groupby(['city','district'], group_keys=False)
         .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq='MS'))
                           .interpolate()
                           .bfill()
                           .ffill())
         .rename_axis('date')
         .reset_index())
print (df2)
         date city district  value
0  2019-01-01    a        d   9.99
1  2019-02-01    a        d  10.66
2  2019-03-01    a        d  10.56
3  2019-04-01    a        d  10.06
4  2019-05-01    a        d  10.69
5  2019-06-01    a        d  10.77
6  2019-01-01    b        e   9.72
7  2019-02-01    b        e   9.72
8  2019-03-01    b        e   9.75
9  2019-04-01    b        e   9.78
10 2019-05-01    b        e   9.76
11 2019-06-01    b        e   9.66
12 2019-01-01    c        f   9.57
13 2019-02-01    c        f   9.57
14 2019-03-01    c        f   9.57
15 2019-04-01    c        f   9.57
16 2019-05-01    c        f   9.47
17 2019-06-01    c        f   9.39

EDIT1:仅针对一列的解决方案:

df2 = (df.set_index('date')
         .groupby(['city','district'])['value']
         .apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq='MS'))
                           .interpolate()
                           .bfill()
                           .ffill())
         .rename_axis(['city','district','date'])
         .reset_index())
print (df2)
   city district       date  value
0     a        d 2019-01-01   9.99
1     a        d 2019-02-01  10.66
2     a        d 2019-03-01  10.56
3     a        d 2019-04-01  10.06
4     a        d 2019-05-01  10.69
5     a        d 2019-06-01  10.77
6     b        e 2019-01-01   9.72
7     b        e 2019-02-01   9.72
8     b        e 2019-03-01   9.75
9     b        e 2019-04-01   9.78
10    b        e 2019-05-01   9.76
11    b        e 2019-06-01   9.66
12    c        f 2019-01-01   9.57
13    c        f 2019-02-01   9.57
14    c        f 2019-03-01   9.57
15    c        f 2019-04-01   9.57
16    c        f 2019-05-01   9.47
17    c        f 2019-06-01   9.39  

答案 2 :(得分:0)

此解决方案:

df['date']=pd.to_datetime(df['date'])

rng = pd.date_range('2019-01-01', '2019-06-01', freq='MS')
c = df['city'].unique()
mux = pd.MultiIndex.from_product([c, rng], names=['city', 'date'])


print(df.set_index(['city', 'date']).reindex(mux).groupby(level=0)\
        .bfill()\
        .ffill()\
        .reset_index())

输出:

   city       date district  value
0     a 2019-01-01        d   9.99
1     a 2019-02-01        d  10.66
2     a 2019-03-01        d  10.56
3     a 2019-04-01        d  10.06
4     a 2019-05-01        d  10.69
5     a 2019-06-01        d  10.77
6     b 2019-01-01        e   9.72
7     b 2019-02-01        e   9.72
8     b 2019-03-01        e   9.78
9     b 2019-04-01        e   9.78
10    b 2019-05-01        e   9.76
11    b 2019-06-01        e   9.66
12    c 2019-01-01        f   9.57
13    c 2019-02-01        f   9.57
14    c 2019-03-01        f   9.57
15    c 2019-04-01        f   9.57
16    c 2019-05-01        f   9.47
17    c 2019-06-01        f   9.39