在pandas数据帧中的groupby对象中从不规则频率转换为每月

时间:2016-02-05 17:32:46

标签: python-2.7 pandas dataframe

我有一个包含用户ID的数据帧df,观察日期(通常是季度频率,但可能是不规则的)和特征值,例如:

from util.Dates import Dates, to_date
import pandas as pd
df = pd.DataFrame(dict(
     RefIssuerId=[11590] * 3 + [115948] * 4,
     AvailableDate=[to_date(d) for d in (20050613, 20050905, 20051214,
                    20040924, 20041101, 20050202,20050516)],
      Characteristic=[0.06, 0.09, 0.07, 0.13, 0.09, 0.06, 0.04]))

UserID     Date    Characteristic
115950  6/13/2005   0.06
115950  9/5/2005    0.09
115950  12/14/2005  0.07
115948  9/24/2004   0.13
115948  11/1/2004   0.09
115948  2/2/2005    0.06
115948  5/16/2005   0.04

我正在尝试将其上传到用户ID组中的每月频率。也就是说,希望得到这样的smth(按用户ID和日期排序)

UserID  Date    Characteristic  month_date
115950  6/13/2005   0.06    6/30/2005
115950  6/13/2005   0.06    7/31/2005
115950  6/13/2005   0.06    8/31/2005
115950  9/5/2005    0.09    9/30/2005
115950  9/5/2005    0.09    10/31/2005
115950  9/5/2005    0.09    11/30/2005
115950  12/14/2005  0.07    12/31/2005
115950  12/14/2005  0.07    1/31/2006
115950  12/14/2005  0.07    2/28/2006
115948  9/24/2004   0.13    9/30/2004
115948  9/24/2004   0.13    10/31/2004
115948  11/1/2004   0.09    11/30/2004
115948  11/1/2004   0.09    12/31/2004
115948  11/1/2004   0.09    1/31/2005
115948  2/2/2005    0.06    2/28/2005
115948  2/2/2005    0.06    3/31/2005
115948  2/2/2005    0.06    4/30/2005
115948  5/16/2005   0.04    5/31/2005
115948  5/16/2005   0.04    6/30/2005
115948  5/16/2005   0.04    7/31/2005

请注意,记录115948 9/24/2004 0.13只会上采样两次,因为下一个可用日期为11/1/2004,在上采样集中生成11/30/2004的month_date。

尝试在groupby数据帧上应用重新采样:

newdf=df.groupby(['UserID']).resample("M",fill_method='ffill')

但这不会产生预期的结果。非常感谢任何指导/建议。

1 个答案:

答案 0 :(得分:1)

您可以resample使用reset_index

import pandas as pd

df_dg = pd.DataFrame(dict(
     UserID=[11590] * 3 + [115948] * 4,
     Date=[20050613, 20050905, 20051214,
                    20040924, 20041101, 20050202,20050516],
      Characteristic=[0.06, 0.09, 0.07, 0.13, 0.09, 0.06, 0.04]), columns=['UserID','Date','Characteristic'])


df_dg['Date'] = pd.to_datetime(df_dg['Date'], format="%Y%m%d")
print df_dg
   UserID       Date  Characteristic
0   11590 2005-06-13            0.06
1   11590 2005-09-05            0.09
2   11590 2005-12-14            0.07
3  115948 2004-09-24            0.13
4  115948 2004-11-01            0.09
5  115948 2005-02-02            0.06
6  115948 2005-05-16            0.04


df_dg['Date1'] = df_dg['Date']

newdf = df_dg.groupby('UserID').apply(lambda x: x.set_index('Date').resample('M', how='first',fill_method='ffill')).reset_index(drop=True, level=0).reset_index()
newdf = newdf.rename(columns={'Date':'month_date', 'Date1':'Date'})
newdf = newdf[['UserID','Date','Characteristic','month_date']]
print newdf
    UserID       Date  Characteristic month_date
0    11590 2005-06-13            0.06 2005-06-30
1    11590 2005-06-13            0.06 2005-07-31
2    11590 2005-06-13            0.06 2005-08-31
3    11590 2005-09-05            0.09 2005-09-30
4    11590 2005-09-05            0.09 2005-10-31
5    11590 2005-09-05            0.09 2005-11-30
6    11590 2005-12-14            0.07 2005-12-31
7   115948 2004-09-24            0.13 2004-09-30
8   115948 2004-09-24            0.13 2004-10-31
9   115948 2004-11-01            0.09 2004-11-30
10  115948 2004-11-01            0.09 2004-12-31
11  115948 2004-11-01            0.09 2005-01-31
12  115948 2005-02-02            0.06 2005-02-28
13  115948 2005-02-02            0.06 2005-03-31
14  115948 2005-02-02            0.06 2005-04-30
15  115948 2005-05-16            0.04 2005-05-31