我有工作日的以下数据框
In [23]: d = pd.DataFrame({'date' : ['20070105', '20070105', '20070106', '20070106', '20070106', '20070109'], 's' : [1, 2, 1,2,3,1], 'i': ['a', 'b', 'a', 'b', 'c', 'a']})
In [26]: d['date'] = pd.to_datetime(d['date'], format='%Y%m%d')
In [27]: d
Out[27]:
date i s
0 2007-01-05 a 1
1 2007-01-05 b 2
2 2007-01-06 a 1
3 2007-01-06 b 2
4 2007-01-06 c 3
5 2007-01-09 a 1
我想填写缺少日期的数据(根据' alldays'日历),输出应如下所示。基本上20070107和20070108丢失了,其数据是从20070106复制的。
Out[31]:
date i s
0 2007-01-05 a 1
1 2007-01-05 b 2
2 2007-01-06 a 1
3 2007-01-06 b 2
4 2007-01-06 c 3
5 2007-01-07 a 1
6 2007-01-07 b 2
7 2007-01-07 c 3
8 2007-01-08 a 1
9 2007-01-08 b 2
10 2007-01-08 c 3
11 2007-01-09 a 1
在熊猫中这样做的最佳方式是什么?
答案 0 :(得分:2)
使用:
set_index
cumcount
获取唯一MultiIndex
unstack
重新设定唯一DatetimeIndex
reindex
按索引中的最小值和最大值,按method='ffill'
stack
MultiIndex
并将Datetimeindex
转换为列df = (
d.set_index(['date', d.groupby('date').cumcount()])
.unstack()
.reindex(pd.date_range(d['date'].min(), d['date'].max()), method='ffill')
.stack()
.reset_index(level=1, drop=True)
.rename_axis('date')
.reset_index()
)
print (df)
date i s
0 2007-01-05 a 1.0
1 2007-01-05 b 2.0
2 2007-01-06 a 1.0
3 2007-01-06 b 2.0
4 2007-01-06 c 3.0
5 2007-01-07 a 1.0
6 2007-01-07 b 2.0
7 2007-01-07 c 3.0
8 2007-01-08 a 1.0
9 2007-01-08 b 2.0
10 2007-01-08 c 3.0
11 2007-01-09 a 1.0