我有一个这样的数据框:
df = pd.DataFrame({"ID":["A", "A", "A" ,"B", "B"], "date":["06/24/2014","06/26/2014","06/29/2014","07/02/1999","07/04/1999"], "value": ["4","6","9","2","4"] })
ID date value
0 A 06/24/2014 4
1 A 06/26/2014 6
2 A 06/29/2014 9
3 B 07/02/1999 2
4 B 07/04/1999 4
对于每个组,我想扩展数据框以包括日期的最大值和最小值之间的所有缺失日期,然后线性插值列值。结果应如下所示:
ID date value
0 A 06/24/2014 4
1 A 06/25/2014 5
2 A 06/26/2014 6
3 A 06/27/2014 7
4 A 06/28/2014 8
5 A 06/29/2014 9
6 B 07/02/1999 2
7 B 07/03/1999 3
8 B 07/04/1999 4
到目前为止我的想法如下:
将日期设为索引:
df.date = pd.DatetimeIndex(df.date)
按ID分组并应用以下功能:
B = df1.groupby('ID').apply(lambda x: x.reindex(pd.date_range(x.date.min(),x.date.max()), fill_value=0) )
最好的办法是什么?
谢谢,
答案 0 :(得分:4)
我必须做一些初步调节以确保正确的dtypes
df = pd.DataFrame({"ID":["A", "A", "A" ,"B", "B"],
"date":["06/24/2014","06/26/2014","06/29/2014","07/02/1999","07/04/1999"],
"value": ["4","6","9","2","4"] })
df.date = pd.to_datetime(df.date)
df.value = pd.to_numeric(df.value, 'coerce')
df = df.set_index('date')
df.groupby('ID', group_keys=False).value \
.apply(lambda df: df.resample('D').interpolate()).reset_index()
答案 1 :(得分:0)
我这样做:
In [6]: df.groupby('ID').apply(lambda x: x.set_index('date').resample('D').pad())
Out[6]:
ID value
ID date
A 2014-06-24 A 4
2014-06-25 A 4
2014-06-26 A 6
2014-06-27 A 6
2014-06-28 A 6
2014-06-29 A 9
B 1999-07-02 B 2
1999-07-03 B 2
1999-07-04 B 4