我有一个数据框(3亿行),其中包括患者的初始住院时间以及该患者的任何再入院。我需要确定再入院。数据框按患者ID(Key)分组,按Key排序,然后'admit'日期升序。确定重新接纳有两个条件:
第一个条件 - 正在测试的行的Num1必须是10-15之间的值。
第二个条件 - 行的允许日期必须介于前一行的放电和90DayPostDischarge日期之间。对第二条规则的警告是,患者可能有多个初始住院时间。这些初始停留将通过与首次停留相隔90天以上来确定。这种情况的一个例子是Key 10003的结果:索引0和2都是初始住院时间。
df = pd.DataFrame({'Key': ['10003', '10003', '10003', '10003', '10003','10034','10034', '10034'],
'Num1': [13,13,13,13,13,13,16,13],
'Num2': [121,122,122,124,125,126,127,128],
'admit': [20120506, 20120511, 20121010,20121015,20121020,20120510,20120516,20120520],
'discharge': [20120510, 20120515, 20121012,20121016,20121023,20120515,20120518,20120522]})
df['admit'] = pd.to_datetime(df['admit'], format='%Y%m%d')
df['discharge'] = pd.to_datetime(df['discharge'], format='%Y%m%d')
df['90DayPostDischarge'] = df['discharge'] + timedelta(days=90)
df
初始df:
Key Num1 Num2 admit discharge 90DayPostDischarge
0 10003 13 121 2012-05-06 2012-05-10 2012-08-08
1 10003 13 122 2012-05-11 2012-05-15 2012-08-13
2 10003 13 122 2012-10-10 2012-10-12 2013-01-10
3 10003 13 124 2012-10-15 2012-10-16 2013-01-14
4 10003 13 125 2012-10-20 2012-10-23 2013-01-21
5 10034 13 126 2012-05-10 2012-05-15 2012-08-13
6 10034 16 127 2012-05-16 2012-05-18 2012-08-16
7 10034 13 128 2012-05-20 2012-05-22 2012-08-20
最终结果:
Key Num1 Num2 admit discharge 90DayPostDischarge Readmit
0 10003 13 121 2012-05-06 2012-05-10 2012-08-08 0 #the first row of every group will be false(0) as it cannot be compared to the previous rows
1 10003 13 122 2012-05-11 2012-05-15 2012-08-13 1 #this qualifies as a readmit to the previous row
2 10003 13 122 2012-10-10 2012-10-12 2013-01-10 0 #this is the same patient but because this row is outside of the previous date ranges, it will be considered a new initial stay
3 10003 13 124 2012-10-15 2012-10-16 2013-01-14 1 #this will be flagged as a readmit to the previous row
4 10003 13 125 2012-10-20 2012-10-23 2013-01-21 1 #this too will be a readmit FOR THE INITIAL STAY AT INDEX 2
5 10034 13 126 2012-05-10 2012-05-15 2012-08-13 0 #the first row of every group will be false(0) as it cannot be compared to the previous rows
6 10034 16 127 2012-05-16 2012-05-18 2012-08-16 0 #this row has a num1 value that is out of the range of 10-15 so it will be flagged as false(0)
7 10034 13 128 2012-05-20 2012-05-22 2012-08-20 1 #this will be flagged as true(1) because of index 5
我的尝试:我首先删除所有不符合第一个条件的行(我意识到制作一个新的df可能不是最好的方法。努力解决这个问题)。其次我尝试标记可能属于第二个条件的行,但是我的代码只返回单个值'true'而不是带有标记列的df。我有一个大脑放屁这种方法。任何帮助将不胜感激。
df2 = df[df['Num1'].isin([10,11,12,13,14,15])]
df2 = df.loc[((df['admit'] > df['discharge'].shift(1)) & \
(df['admit'] <= df['90DayPostDischarge'].shift(1))),'readmit'] = 'true'