对于这样的数据框,如何将id
分组并填充缺失的月份,同时将price
的缺失月份保留为na
,预期的日期范围是{{1 }}到2015/1/1
。
2019/8/1
答案 0 :(得分:1)
编辑:
在真实数据中,每列city
,district
,id
,date
必需的唯一值:
df = df.groupby(['city','district','id', 'date'], as_index=False)['price'].sum()
如果需要按id
列分组:
rng = pd.date_range('2015-01-01','2019-08-01', freq='MS')
df['date'] = pd.to_datetime(df['date'])
df1 = (df.set_index('date')
.groupby('id')
.apply(lambda x: x.reindex(rng))
.rename_axis(('id','date'))
.drop('id', axis=1)
.reset_index()
)
print (df1)
id date city district price
0 20101 2015-01-01 NaN NaN NaN
1 20101 2015-02-01 NaN NaN NaN
2 20101 2015-03-01 NaN NaN NaN
3 20101 2015-04-01 NaN NaN NaN
4 20101 2015-05-01 NaN NaN NaN
.. ... ... ... ... ...
163 20103 2019-04-01 NaN NaN NaN
164 20103 2019-05-01 NaN NaN NaN
165 20103 2019-06-01 NaN NaN NaN
166 20103 2019-07-01 NaN NaN NaN
167 20103 2019-08-01 NaN NaN NaN
[168 rows x 5 columns]
如果需要按更多列分组:
rng = pd.date_range('2015-01-01','2019-08-01', freq='MS')
df['date'] = pd.to_datetime(df['date'])
df2 = (df.set_index('date')
.groupby(['city','district','id'])['price']
.apply(lambda x: x.reindex(rng, fill_value=0))
.rename_axis(('city','district','id','date'))
.reset_index()
)
print (df2)
city district id date price
0 hz sn 20101 2015-01-01 0.0
1 hz sn 20101 2015-02-01 0.0
2 hz sn 20101 2015-03-01 0.0
3 hz sn 20101 2015-04-01 0.0
4 hz sn 20101 2015-05-01 0.0
.. ... ... ... ... ...
219 xz pd 20103 2019-04-01 0.0
220 xz pd 20103 2019-05-01 0.0
221 xz pd 20103 2019-06-01 0.0
222 xz pd 20103 2019-07-01 0.0
223 xz pd 20103 2019-08-01 0.0
[224 rows x 5 columns]
答案 1 :(得分:1)
将reindex
与MS
一起使用,这是月份开始,而pd.concat
与GroupBy
一起使用:
dates = pd.date_range('2015-01-01','2019-08-01', freq='MS')
new = pd.concat([
d.set_index('date').reindex(dates).reset_index().rename(columns={'index':'date'}) for _, d in df.groupby('id')
], ignore_index=True)
new = new.ffill().bfill()
输出
date city district id price
0 2015-01-01 hz sn 20101.0 2.2
1 2015-02-01 hz sn 20101.0 2.2
2 2015-03-01 hz sn 20101.0 2.2
3 2015-04-01 hz sn 20101.0 2.2
4 2015-05-01 hz sn 20101.0 2.2
.. ... ... ... ... ...
163 2019-04-01 xz pd 20103.0 3.1
164 2019-05-01 xz pd 20103.0 3.1
165 2019-06-01 xz pd 20103.0 3.1
166 2019-07-01 xz pd 20103.0 3.1
167 2019-08-01 xz pd 20103.0 3.1
[168 rows x 5 columns]