Question

我有一个很大的数据集，其中包含每个property_id的费用，该费用是根据以下数据集得出的每月发生的费用。

property_id period      amount 
1           2016-07-01  105908.20
1           2016-08-01  0.00
2           2016-08-01  114759.40
3           2014-05-01  -934.00
3           2014-06-01  -845.95
3           2017-12-01  92175.77
4           2015-09-01  -1859.75
4           2015-12-01  1859.75
4           2017-12-01  130105.00
5           2014-07-01  -6929.58

我想创建一个按property_id分组的累积总和，并从该property_id的第一个月到最近的整个月，每个月结转。

我尝试了以下操作，其中我正在使用property_id重采样并尝试转发填充，但这会导致错误

cost = cost.groupby['property_id'].apply(lambda x: x.set_index('period').resample('M').fillna(method='pad'))

TypeError：“方法”对象不可下标

以下示例输出：

> property_id   period      amount 
>     1         2016-07-01  105908.20
>     1         2016-08-01  105908.20
>     1         2016-09-01  105908.20
>     1         2016-10-01  105908.20
>     ...           
>     1         2019-07-01  105908.20
>     2         2016-08-01  114759.40
>     2         2016-09-01  114759.40
>     2         2016-10-01  114759.40
>     ...
>     2         2019-07-01  114759.40
>     3         2014-05-01  -934.00
>     3         2014-06-01  -1779.95
>     3         2014-07-01  -1779.95
>     3         2014-08-01  -1779.95
>     ...
>     3         2017-12-01  90395.82
>     3         2018-01-01  90395.82
>     3         2018-02-01  90395.82
>     3         2018-03-01  90395.82
>     ...
>     3         2019-07-01  90395.82
>     4         2015-09-01  -1859.75
>     4         2015-10-01  -1859.75
>     4         2015-11-01  -1859.75
>     4         2015-12-01  0
>     4         2016-01-01  0
>     ...
>     4         2017-11-01  0
>     4         2017-12-01  130105.00
>     4         2018-01-01  130105.00
>     ...
>     4         2019-07-01  130105.00
>     5         2014-07-01  -6929.58
>     5         2014-08-01  -6929.58
>     5         2014-09-01  -6929.58
>     ...
>     5         2019-07-01  -6929.58

任何帮助都会很棒。

谢谢！

Answer 1

首先创建DatetimeIndex，然后将groupby与resample一起使用：

df['period'] = pd.to_datetime(df['period'])

df1 = df.set_index('period').groupby('property_id').resample('M').pad()
#alternative
#df1 = df.set_index('period').groupby('property_id').resample('M').ffill()
print (df1)

                        property_id     amount
property_id period                            
1           2016-07-31            1  105908.20
            2016-08-31            1       0.00
2           2016-08-31            2  114759.40
3           2014-05-31            3    -934.00
            2014-06-30            3    -845.95
                            ...        ...
4           2017-09-30            4    1859.75
            2017-10-31            4    1859.75
            2017-11-30            4    1859.75
            2017-12-31            4  130105.00
5           2014-07-31            5   -6929.58

[76 rows x 2 columns]

编辑：想法是通过按property_id的最后一个值进行过滤来创建新的DataFrame，然后按条件分配月份，然后附加到原始内容并使用上面的解决方案：

df['period'] = pd.to_datetime(df['period'])
df = df.sort_values(['property_id','period'])

last = pd.to_datetime('now').floor('d') 
nextday = (last + pd.Timedelta(1, 'd')).day
orig_month = last.to_period('m').to_timestamp()
before_month = (last.to_period('m') - 1).to_timestamp()

last = orig_month if nextday == 1 else before_month
print (last)
2019-07-01 00:00:00

df1 = df.drop_duplicates('property_id', keep='last').assign(period=last)
print (df1)
   property_id     period     amount
1            1 2019-07-01       0.00
2            2 2019-07-01  114759.40
5            3 2019-07-01   92175.77
8            4 2019-07-01  130105.00
9            5 2019-07-01   -6929.58

df = pd.concat([df, df1])
df1 = (df.set_index('period')
        .groupby('property_id')['amount']
        .resample('MS')
        .asfreq(fill_value=0)
        .groupby(level=0)
        .cumsum())
print (df1)
property_id  period    
1            2016-07-01    105908.20
             2016-08-01    105908.20
             2016-09-01    105908.20
             2016-10-01    105908.20
             2016-11-01    105908.20

5            2019-03-01   -394986.06
             2019-04-01   -401915.64
             2019-05-01   -408845.22
             2019-06-01   -415774.80
             2019-07-01   -422704.38
Name: amount, Length: 244, dtype: float64

Python：重新采样和前填充到最近一个月

1 个答案: