我有一个很大的数据集,其中包含每个property_id的费用,该费用是根据以下数据集得出的每月发生的费用。
property_id period amount
1 2016-07-01 105908.20
1 2016-08-01 0.00
2 2016-08-01 114759.40
3 2014-05-01 -934.00
3 2014-06-01 -845.95
3 2017-12-01 92175.77
4 2015-09-01 -1859.75
4 2015-12-01 1859.75
4 2017-12-01 130105.00
5 2014-07-01 -6929.58
我想创建一个按property_id分组的累积总和,并从该property_id的第一个月到最近的整个月,每个月结转。
我尝试了以下操作,其中我正在使用property_id重采样并尝试转发填充,但这会导致错误
cost = cost.groupby['property_id'].apply(lambda x: x.set_index('period').resample('M').fillna(method='pad'))
TypeError:“方法”对象不可下标
以下示例输出:
> property_id period amount
> 1 2016-07-01 105908.20
> 1 2016-08-01 105908.20
> 1 2016-09-01 105908.20
> 1 2016-10-01 105908.20
> ...
> 1 2019-07-01 105908.20
> 2 2016-08-01 114759.40
> 2 2016-09-01 114759.40
> 2 2016-10-01 114759.40
> ...
> 2 2019-07-01 114759.40
> 3 2014-05-01 -934.00
> 3 2014-06-01 -1779.95
> 3 2014-07-01 -1779.95
> 3 2014-08-01 -1779.95
> ...
> 3 2017-12-01 90395.82
> 3 2018-01-01 90395.82
> 3 2018-02-01 90395.82
> 3 2018-03-01 90395.82
> ...
> 3 2019-07-01 90395.82
> 4 2015-09-01 -1859.75
> 4 2015-10-01 -1859.75
> 4 2015-11-01 -1859.75
> 4 2015-12-01 0
> 4 2016-01-01 0
> ...
> 4 2017-11-01 0
> 4 2017-12-01 130105.00
> 4 2018-01-01 130105.00
> ...
> 4 2019-07-01 130105.00
> 5 2014-07-01 -6929.58
> 5 2014-08-01 -6929.58
> 5 2014-09-01 -6929.58
> ...
> 5 2019-07-01 -6929.58
任何帮助都会很棒。
谢谢!
答案 0 :(得分:1)
首先创建DatetimeIndex
,然后将groupby
与resample
一起使用:
df['period'] = pd.to_datetime(df['period'])
df1 = df.set_index('period').groupby('property_id').resample('M').pad()
#alternative
#df1 = df.set_index('period').groupby('property_id').resample('M').ffill()
print (df1)
property_id amount
property_id period
1 2016-07-31 1 105908.20
2016-08-31 1 0.00
2 2016-08-31 2 114759.40
3 2014-05-31 3 -934.00
2014-06-30 3 -845.95
... ...
4 2017-09-30 4 1859.75
2017-10-31 4 1859.75
2017-11-30 4 1859.75
2017-12-31 4 130105.00
5 2014-07-31 5 -6929.58
[76 rows x 2 columns]
编辑:想法是通过按property_id
的最后一个值进行过滤来创建新的DataFrame,然后按条件分配月份,然后附加到原始内容并使用上面的解决方案:
df['period'] = pd.to_datetime(df['period'])
df = df.sort_values(['property_id','period'])
last = pd.to_datetime('now').floor('d')
nextday = (last + pd.Timedelta(1, 'd')).day
orig_month = last.to_period('m').to_timestamp()
before_month = (last.to_period('m') - 1).to_timestamp()
last = orig_month if nextday == 1 else before_month
print (last)
2019-07-01 00:00:00
df1 = df.drop_duplicates('property_id', keep='last').assign(period=last)
print (df1)
property_id period amount
1 1 2019-07-01 0.00
2 2 2019-07-01 114759.40
5 3 2019-07-01 92175.77
8 4 2019-07-01 130105.00
9 5 2019-07-01 -6929.58
df = pd.concat([df, df1])
df1 = (df.set_index('period')
.groupby('property_id')['amount']
.resample('MS')
.asfreq(fill_value=0)
.groupby(level=0)
.cumsum())
print (df1)
property_id period
1 2016-07-01 105908.20
2016-08-01 105908.20
2016-09-01 105908.20
2016-10-01 105908.20
2016-11-01 105908.20
5 2019-03-01 -394986.06
2019-04-01 -401915.64
2019-05-01 -408845.22
2019-06-01 -415774.80
2019-07-01 -422704.38
Name: amount, Length: 244, dtype: float64