示例数据框:
Date | ID | Type 1 | Type 2 | Type 3
-----------------------------------------
2017-06-05 | 1 | 2 | 1 | 0
2017-08-05 | 1 | 0 | 1 | 0
2017-10-05 | 1 | 2 | 1 | 1
2017-06-05 | 2 | 0 | 1 | 0
2017-07-05 | 2 | 2 | 0 | 0
2017-09-15 | 3 | 0 | 0 | 5
我想按月分组,以便每个ID每月都有行,直到最后一个可用数据为止。例如,在这种情况下,ID = 1具有从第6个月到第10个月的数据。因此,ID = 1每月从第6个月到第10个月获得行。
ID = 1的预期输出:
Date | ID | Type 1 | Type 2 | Type 3
-----------------------------------------
2017-06-05 | 1 | 2 | 1 | 0
2017-07-05 | 1 | 2 | 1 | 0
2017-08-05 | 1 | 0 | 1 | 0
2017-09-05 | 1 | 0 | 1 | 0
2017-10-05 | 1 | 2 | 1 | 1
可以观察到,类型列并不总结,而是过去的数据填满了行。比如,对于第7个月的数据,使用第6个月相同的数据。
以下情景超出了此问题的范围:
如果输入数据帧在同一个月内有多行。
Date | ID | Type 1 | Type 2 | Type 3
-----------------------------------------
2017-06-05 | 1 | 2 | 1 | 0
2017-06-19 | 1 | 0 | 1 | 0
2017-10-05 | 1 | 2 | 1 | 1
2017-06-05 | 2 | 0 | 1 | 0
2017-06-25 | 2 | 2 | 0 | 0
2017-09-15 | 3 | 0 | 0 | 5
如何在这种情况下聚合,使每个月每个ID只有一行?
答案 0 :(得分:1)
主要问题是添加天数,因为MS
重新抽样 - 月初:
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m%d')
#replace days to 1
t1 = df['Date'].dt.to_period('m').dt.to_timestamp()
a = df['Date'] - t1
#create MultiIndex Series with difference of days from 1's day od month
s = pd.Series(a.values, index=[df['ID'], t1])
print (s)
ID Date
1 2017-06-01 4 days
2017-08-01 4 days
2017-10-01 4 days
2 2017-06-01 4 days
2017-07-01 4 days
3 2017-09-01 14 days
dtype: timedelta64[ns]
#helper df2 for append missing NaNs rows
df2 = df.set_index(['ID','Date'])
#add missing dates with resample by start od month and forward fill NaNs
df1 = df.set_index(['Date']).groupby('ID').resample('MS').ffill()
print (df1)
ID Type 1 Type 2 Type 3
ID Date
1 2017-06-01 NaN NaN NaN NaN
2017-07-01 1.0 2.0 1.0 0.0
2017-08-01 1.0 2.0 1.0 0.0
2017-09-01 1.0 0.0 1.0 0.0
2017-10-01 1.0 0.0 1.0 0.0
2 2017-06-01 NaN NaN NaN NaN
2017-07-01 2.0 0.0 1.0 0.0
3 2017-09-01 NaN NaN NaN NaN
#add missing timedeltas by added rows in df1 by forward filling
s1 = s.reindex(df1.index, method='ffill')
print (s1)
ID Date
1 2017-06-01 4 days
2017-07-01 4 days
2017-08-01 4 days
2017-09-01 4 days
2017-10-01 4 days
2 2017-06-01 4 days
2017-07-01 4 days
3 2017-09-01 14 days
dtype: timedelta64[ns]
#create final MultiIndex with added timedelta by set_index
mux = [df1.index.get_level_values('ID'),
df1.index.get_level_values('Date') + s1.values]
#add missing NaNs rows with combine original
df = df1.drop('ID', 1).set_index(mux).combine_first(df2).reset_index()
print (df)
ID Date Type 1 Type 2 Type 3
0 1 2017-06-05 2.0 1.0 0.0
1 1 2017-07-05 2.0 1.0 0.0
2 1 2017-08-05 2.0 1.0 0.0
3 1 2017-09-05 0.0 1.0 0.0
4 1 2017-10-05 0.0 1.0 0.0
5 2 2017-06-05 0.0 1.0 0.0
6 2 2017-07-05 0.0 1.0 0.0
7 3 2017-09-15 0.0 0.0 5.0
编辑:
#set days to 1
df['Date'] = df['Date'] - pd.offsets.MonthBegin()
#aggregate for unique months
df1 = df.groupby(['Date','ID']).sum()
print (df1)
Type 1 Type 2 Type 3
Date ID
2017-06-01 1 2 2 0
2 2 1 0
2017-09-01 3 0 0 5
2017-10-01 1 2 1 1
#add missing months by resample
df1 = df1.reset_index(['ID']).groupby('ID').resample('MS').ffill()
print (df1)
ID Type 1 Type 2 Type 3
ID Date
1 2017-06-01 1 2 2 0
2017-07-01 1 2 2 0
2017-08-01 1 2 2 0
2017-09-01 1 2 2 0
2017-10-01 1 2 1 1
2 2017-06-01 2 2 1 0
3 2017-09-01 3 0 0 5