我有一个包含用户ID的数据帧df,观察日期(通常是季度频率,但可能是不规则的)和特征值,例如:
from util.Dates import Dates, to_date
import pandas as pd
df = pd.DataFrame(dict(
RefIssuerId=[11590] * 3 + [115948] * 4,
AvailableDate=[to_date(d) for d in (20050613, 20050905, 20051214,
20040924, 20041101, 20050202,20050516)],
Characteristic=[0.06, 0.09, 0.07, 0.13, 0.09, 0.06, 0.04]))
UserID Date Characteristic
115950 6/13/2005 0.06
115950 9/5/2005 0.09
115950 12/14/2005 0.07
115948 9/24/2004 0.13
115948 11/1/2004 0.09
115948 2/2/2005 0.06
115948 5/16/2005 0.04
我正在尝试将其上传到用户ID组中的每月频率。也就是说,希望得到这样的smth(按用户ID和日期排序)
UserID Date Characteristic month_date
115950 6/13/2005 0.06 6/30/2005
115950 6/13/2005 0.06 7/31/2005
115950 6/13/2005 0.06 8/31/2005
115950 9/5/2005 0.09 9/30/2005
115950 9/5/2005 0.09 10/31/2005
115950 9/5/2005 0.09 11/30/2005
115950 12/14/2005 0.07 12/31/2005
115950 12/14/2005 0.07 1/31/2006
115950 12/14/2005 0.07 2/28/2006
115948 9/24/2004 0.13 9/30/2004
115948 9/24/2004 0.13 10/31/2004
115948 11/1/2004 0.09 11/30/2004
115948 11/1/2004 0.09 12/31/2004
115948 11/1/2004 0.09 1/31/2005
115948 2/2/2005 0.06 2/28/2005
115948 2/2/2005 0.06 3/31/2005
115948 2/2/2005 0.06 4/30/2005
115948 5/16/2005 0.04 5/31/2005
115948 5/16/2005 0.04 6/30/2005
115948 5/16/2005 0.04 7/31/2005
请注意,记录115948 9/24/2004 0.13
只会上采样两次,因为下一个可用日期为11/1/2004
,在上采样集中生成11/30/2004
的month_date。
尝试在groupby数据帧上应用重新采样:
newdf=df.groupby(['UserID']).resample("M",fill_method='ffill')
但这不会产生预期的结果。非常感谢任何指导/建议。
答案 0 :(得分:1)
您可以resample
使用reset_index
:
import pandas as pd
df_dg = pd.DataFrame(dict(
UserID=[11590] * 3 + [115948] * 4,
Date=[20050613, 20050905, 20051214,
20040924, 20041101, 20050202,20050516],
Characteristic=[0.06, 0.09, 0.07, 0.13, 0.09, 0.06, 0.04]), columns=['UserID','Date','Characteristic'])
df_dg['Date'] = pd.to_datetime(df_dg['Date'], format="%Y%m%d")
print df_dg
UserID Date Characteristic
0 11590 2005-06-13 0.06
1 11590 2005-09-05 0.09
2 11590 2005-12-14 0.07
3 115948 2004-09-24 0.13
4 115948 2004-11-01 0.09
5 115948 2005-02-02 0.06
6 115948 2005-05-16 0.04
df_dg['Date1'] = df_dg['Date']
newdf = df_dg.groupby('UserID').apply(lambda x: x.set_index('Date').resample('M', how='first',fill_method='ffill')).reset_index(drop=True, level=0).reset_index()
newdf = newdf.rename(columns={'Date':'month_date', 'Date1':'Date'})
newdf = newdf[['UserID','Date','Characteristic','month_date']]
print newdf
UserID Date Characteristic month_date
0 11590 2005-06-13 0.06 2005-06-30
1 11590 2005-06-13 0.06 2005-07-31
2 11590 2005-06-13 0.06 2005-08-31
3 11590 2005-09-05 0.09 2005-09-30
4 11590 2005-09-05 0.09 2005-10-31
5 11590 2005-09-05 0.09 2005-11-30
6 11590 2005-12-14 0.07 2005-12-31
7 115948 2004-09-24 0.13 2004-09-30
8 115948 2004-09-24 0.13 2004-10-31
9 115948 2004-11-01 0.09 2004-11-30
10 115948 2004-11-01 0.09 2004-12-31
11 115948 2004-11-01 0.09 2005-01-31
12 115948 2005-02-02 0.06 2005-02-28
13 115948 2005-02-02 0.06 2005-03-31
14 115948 2005-02-02 0.06 2005-04-30
15 115948 2005-05-16 0.04 2005-05-31