我有以下数据框:
DATE ID STATUS
0 2014-01-01 1 INPROGRESS
1 2013-03-01 1 ENDED
2 2015-05-01 2 INPROGRESS
3 2012-05-01 1 STARTED
4 2011-05-01 2 STARTED
5 2011-03-01 3 STARTED
6 2011-04-01 3 ENDED
7 2011-06-01 3 INPROGRESS
8 2011-09-01 3 STARTED
这里是构建它的代码:
>>> df1 = pd.DataFrame(columns=["DATE", "ID", "STATUS"])
>>> df1["DATE"] = ['2014-01-01', '2013-03-01', '2015-05-01', '2012-05-01', '2011-05-01', '2011-03-01', '2011-04-01', '2011-06-01', '2011-09-01']
>>> df1["ID"] = [1,1,2,1,2,3,3,3,3]
>>> df1["STATUS"] = ['INPROGRESS', 'ENDED', 'INPROGRESS', 'STARTED', 'STARTED', 'STARTED','ENDED', 'INPROGRESS', 'STARTED']
对于每个ID组,状态列表示可以是以下任务的任务:
STARTED,INPROGRESS或ENDED
按照这个精确的时间顺序(在ENDED之后不应该开始......)。
按ID分组并按日期排序我得到ID 3:
df1.sort_values('DATE')[df1['ID']==3]
DATE ID STATUS
5 2011-03-01 3 STARTED
6 2011-04-01 3 ENDED
7 2011-06-01 3 INPROGRESS
8 2011-09-01 3 STARTED
否我需要根据 last 状态“修复”状态列以遵循上面定义的顺序。对于ID 3,最后一个状态已启动,因此应将所有内容重新填充到已启动状态,如下所示:
DATE ID STATUS
5 2011-03-01 3 STARTED
6 2011-04-01 3 STARTED
7 2011-06-01 3 STARTED
8 2011-09-01 3 STARTED
对于ID 1:
df1.sort_values('DATE')[df1['ID']==1]
DATE ID STATUS
3 2012-05-01 1 STARTED
1 2013-03-01 1 ENDED
0 2014-01-01 1 INPROGRESS
我最终会得到最后两个状态INPROGRESS并将第一个状态保留为STARTED:
df1.sort_values('DATE')[df1['ID']==1]
DATE ID STATUS
3 2012-05-01 1 STARTED
1 2013-03-01 1 INPROGRESS
0 2014-01-01 1 INPROGRESS
ID 2的订单正确。
任何想法我怎么能用熊猫做到这一点? 我正在尝试通过ID分组,我正在考虑基于最后状态的回填,但我不知道如何在适当的时候停止回填。
谢谢!
答案 0 :(得分:2)
一种经典的方法是忘记你的状态是标签:而不是将它们视为严格增加的数字,如开始1,进行中2和结束3.使用这样的列,你现在可以检查每组的这些数字的单调,然后回填,直到你看到单调的中断。
准备您的数据框:
keymapping = {'STARTED':0, 'INPROGRESS':1, 'ENDED':2}
df['STATUS_ID'] = df.STATUS.map(keymapping)
df.set_index(['ID', 'DATE'], inplace=True)
df.sort_index(inplace=True)
现在,按ID进行分组并使用transform
获取遍布整个索引的每个组的最后一个值,以便您可以将其作为新列分配给数据框:
df['STATUS_LAST'] = df.groupby(level=0, as_index=False).STATUS_ID.transform('last')
df
Out[63]:
STATUS STATUS_ID STATUS_LAST
ID DATE
1 2012-05-01 STARTED 0 1
2013-03-01 ENDED 2 1
2014-01-01 INPROGRESS 1 1
2 2011-05-01 STARTED 0 1
2015-05-01 INPROGRESS 1 1
3 2011-03-01 STARTED 0 0
2011-04-01 ENDED 2 0
2011-06-01 INPROGRESS 1 0
2011-09-01 STARTED 0 0
最后,使用STATUS_ID
针对last的单调增加来应用回填,即当{0}小于或等于STATUS_ID
时,STATUS_LAST
的每个值都有效:
df.STATUS_ID = df.STATUS_ID.where(df.STATUS_ID <= df.STATUS_LAST, df.STATUS_LAST)
df.STATUS_ID
Out[65]:
ID DATE
1 2012-05-01 0
2013-03-01 1
2014-01-01 1
2 2011-05-01 0
2015-05-01 1
3 2011-03-01 0
2011-04-01 0
2011-06-01 0
2011-09-01 0
将其反转映射到标签并将其分配给STATUS
:
df.STATUS_ID.map({v:k for k,v in keymapping.items()})
Out[66]:
ID DATE
1 2012-05-01 STARTED
2013-03-01 INPROGRESS
2014-01-01 INPROGRESS
2 2011-05-01 STARTED
2015-05-01 INPROGRESS
3 2011-03-01 STARTED
2011-04-01 STARTED
2011-06-01 STARTED
2011-09-01 STARTED
Name: STATUS_ID, dtype: object