对于以下数据框,如何填充每个组city
和district
的缺失日期,假设整个日期范围是从2019/1/1
到2019/6/1
,然后填充空的value
和单元格之前和之后的mean
s,如果之前或之后没有值,则使用bfill
或ffill
。
city district date value
0 a d 2019/1/1 9.99
1 a d 2019/2/1 10.66
2 a d 2019/3/1 10.56
3 a d 2019/4/1 10.06
4 a d 2019/5/1 10.69
5 a d 2019/6/1 10.77
6 b e 2019/1/1 9.72
7 b e 2019/2/1 9.72
8 b e 2019/4/1 9.78
9 b e 2019/5/1 9.76
10 b e 2019/6/1 9.66
11 c f 2019/4/1 9.57
12 c f 2019/5/1 9.47
13 c f 2019/6/1 9.39
预期结果如下:
city district date value
0 a d 2019/1/1 9.99
1 a d 2019/2/1 10.66
2 a d 2019/3/1 10.56
3 a d 2019/4/1 10.06
4 a d 2019/5/1 10.69
5 a d 2019/6/1 10.77
6 b e 2019/1/1 9.72
7 b e 2019/2/1 9.72
8 b e 2019/3/1 9.75
9 b e 2019/4/1 9.78
10 b e 2019/5/1 9.76
11 b e 2019/6/1 9.66
12 c f 2019/1/1 9.57
13 c f 2019/2/1 9.57
14 c f 2019/3/1 9.57
15 c f 2019/4/1 9.57
16 c f 2019/5/1 9.47
17 c f 2019/6/1 9.39
如何在熊猫中做到这一点?非常感谢。
更新:
当我添加freq = 'M'
时,全部变成NaN
。
df['date']=pd.to_datetime(df['date'])
( df.set_index('date')
.groupby(['city','district'],as_index=False)
.apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq = 'M'))
.interpolate()
.bfill()
.ffill())
.rename_axis(index = [0,'date'])
.reset_index()
.drop(0,axis=1)
)
输出:
date city district value
0 2019-01-31 NaN NaN NaN
1 2019-02-28 NaN NaN NaN
2 2019-03-31 NaN NaN NaN
3 2019-04-30 NaN NaN NaN
4 2019-05-31 NaN NaN NaN
5 2019-01-31 NaN NaN NaN
6 2019-02-28 NaN NaN NaN
7 2019-03-31 NaN NaN NaN
8 2019-04-30 NaN NaN NaN
9 2019-05-31 NaN NaN NaN
10 2019-01-31 NaN NaN NaN
11 2019-02-28 NaN NaN NaN
12 2019-03-31 NaN NaN NaN
13 2019-04-30 NaN NaN NaN
14 2019-05-31 NaN NaN NaN
答案 0 :(得分:2)
我们可以做到:
df['date']=pd.to_datetime(df['date'],format ='%YYYY/%dd/%mm' )
( df.set_index('date')
.groupby(['city','district'],as_index=False)
.apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max()))
.interpolate()
.bfill()
.ffill())
.rename_axis(index = [0,'date'])
.reset_index()
.drop(0,axis=1)
)
输出
date city district value
0 2019-01-01 00:01:00 a d 9.99
1 2019-01-02 00:01:00 a d 10.66
2 2019-01-03 00:01:00 a d 10.56
3 2019-01-04 00:01:00 a d 10.06
4 2019-01-05 00:01:00 a d 10.69
5 2019-01-06 00:01:00 a d 10.77
6 2019-01-01 00:01:00 b e 9.72
7 2019-01-02 00:01:00 b e 9.72
8 2019-01-03 00:01:00 b e 9.75
9 2019-01-04 00:01:00 b e 9.78
10 2019-01-05 00:01:00 b e 9.76
11 2019-01-06 00:01:00 b e 9.66
12 2019-01-01 00:01:00 c f 9.57
13 2019-01-02 00:01:00 c f 9.57
14 2019-01-03 00:01:00 c f 9.57
15 2019-01-04 00:01:00 c f 9.57
16 2019-01-05 00:01:00 c f 9.47
17 2019-01-06 00:01:00 c f 9.39
答案 1 :(得分:1)
您可以使用每组替换misisng值来更改您的解决方案,如果每组仅NaN
个值,则可以避免错误替换:
df['date']=pd.to_datetime(df['date'])
rng = pd.date_range('2019-01-01', '2019-06-01', freq='MS')
c = df['city'].unique()
mux = pd.MultiIndex.from_product([c, rng], names=['city', 'date'])
df1 = (df.set_index(['city', 'date']).reindex(mux, method='ffill')
.groupby(level=0)
.apply(lambda x: x.bfill().ffill())
.reset_index())
print (df1)
city date district value
0 a 2019-01-01 d 9.99
1 a 2019-02-01 d 10.66
2 a 2019-03-01 d 10.56
3 a 2019-04-01 d 10.06
4 a 2019-05-01 d 10.69
5 a 2019-06-01 d 10.77
6 b 2019-01-01 e 9.72
7 b 2019-02-01 e 9.72
8 b 2019-03-01 e 9.72
9 b 2019-04-01 e 9.78
10 b 2019-05-01 e 9.76
11 b 2019-06-01 e 9.66
12 c 2019-01-01 e 9.66
13 c 2019-02-01 e 9.66
14 c 2019-03-01 e 9.66
15 c 2019-04-01 f 9.57
16 c 2019-05-01 f 9.47
17 c 2019-06-01 f 9.39
或对reindex
和method='bfill'
使用自定义功能:
df2 = (df.set_index('date')
.groupby(['city','district'], group_keys=False)
.apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq='MS'), method='bfill')
.ffill())
.rename_axis('date')
.reset_index())
print (df2)
date city district value
0 2019-01-01 a d 9.99
1 2019-02-01 a d 10.66
2 2019-03-01 a d 10.56
3 2019-04-01 a d 10.06
4 2019-05-01 a d 10.69
5 2019-06-01 a d 10.77
6 2019-01-01 b e 9.72
7 2019-02-01 b e 9.72
8 2019-03-01 b e 9.78
9 2019-04-01 b e 9.78
10 2019-05-01 b e 9.76
11 2019-06-01 b e 9.66
12 2019-01-01 c f 9.57
13 2019-02-01 c f 9.57
14 2019-03-01 c f 9.57
15 2019-04-01 c f 9.57
16 2019-05-01 c f 9.47
17 2019-06-01 c f 9.39
使用interpolate
的解决方案:
df2 = (df.set_index('date')
.groupby(['city','district'], group_keys=False)
.apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq='MS'))
.interpolate()
.bfill()
.ffill())
.rename_axis('date')
.reset_index())
print (df2)
date city district value
0 2019-01-01 a d 9.99
1 2019-02-01 a d 10.66
2 2019-03-01 a d 10.56
3 2019-04-01 a d 10.06
4 2019-05-01 a d 10.69
5 2019-06-01 a d 10.77
6 2019-01-01 b e 9.72
7 2019-02-01 b e 9.72
8 2019-03-01 b e 9.75
9 2019-04-01 b e 9.78
10 2019-05-01 b e 9.76
11 2019-06-01 b e 9.66
12 2019-01-01 c f 9.57
13 2019-02-01 c f 9.57
14 2019-03-01 c f 9.57
15 2019-04-01 c f 9.57
16 2019-05-01 c f 9.47
17 2019-06-01 c f 9.39
EDIT1:仅针对一列的解决方案:
df2 = (df.set_index('date')
.groupby(['city','district'])['value']
.apply(lambda x: x.reindex(pd.date_range(df.date.min(),df.date.max(), freq='MS'))
.interpolate()
.bfill()
.ffill())
.rename_axis(['city','district','date'])
.reset_index())
print (df2)
city district date value
0 a d 2019-01-01 9.99
1 a d 2019-02-01 10.66
2 a d 2019-03-01 10.56
3 a d 2019-04-01 10.06
4 a d 2019-05-01 10.69
5 a d 2019-06-01 10.77
6 b e 2019-01-01 9.72
7 b e 2019-02-01 9.72
8 b e 2019-03-01 9.75
9 b e 2019-04-01 9.78
10 b e 2019-05-01 9.76
11 b e 2019-06-01 9.66
12 c f 2019-01-01 9.57
13 c f 2019-02-01 9.57
14 c f 2019-03-01 9.57
15 c f 2019-04-01 9.57
16 c f 2019-05-01 9.47
17 c f 2019-06-01 9.39
答案 2 :(得分:0)
此解决方案:
df['date']=pd.to_datetime(df['date'])
rng = pd.date_range('2019-01-01', '2019-06-01', freq='MS')
c = df['city'].unique()
mux = pd.MultiIndex.from_product([c, rng], names=['city', 'date'])
print(df.set_index(['city', 'date']).reindex(mux).groupby(level=0)\
.bfill()\
.ffill()\
.reset_index())
输出:
city date district value
0 a 2019-01-01 d 9.99
1 a 2019-02-01 d 10.66
2 a 2019-03-01 d 10.56
3 a 2019-04-01 d 10.06
4 a 2019-05-01 d 10.69
5 a 2019-06-01 d 10.77
6 b 2019-01-01 e 9.72
7 b 2019-02-01 e 9.72
8 b 2019-03-01 e 9.78
9 b 2019-04-01 e 9.78
10 b 2019-05-01 e 9.76
11 b 2019-06-01 e 9.66
12 c 2019-01-01 f 9.57
13 c 2019-02-01 f 9.57
14 c 2019-03-01 f 9.57
15 c 2019-04-01 f 9.57
16 c 2019-05-01 f 9.47
17 c 2019-06-01 f 9.39