如果在上一行

时间:2018-03-05 19:28:40

标签: python pandas datetime pandas-groupby

我有一个数据框(3亿行),其中包括患者的初始住院时间以及该患者的任何再入院。我需要确定再入院。数据框按患者ID(Key)分组,按Key排序,然后'admit'日期升序。确定重新接纳有两个条件:

第一个条件 - 正在测试的行的Num1必须是10-15之间的值。

第二个条件 - 行的允许日期必须介于前一行的放电和90DayPostDischarge日期之间。对第二条规则的警告是,患者可能有多个初始住院时间。这些初始停留将通过与首次停留相隔90天以上来确定。这种情况的一个例子是Key 10003的结果:索引0和2都是初始住院时间。

df =  pd.DataFrame({'Key': ['10003', '10003', '10003', '10003', '10003','10034','10034', '10034'], 
               'Num1': [13,13,13,13,13,13,16,13],
               'Num2': [121,122,122,124,125,126,127,128],
              'admit': [20120506, 20120511, 20121010,20121015,20121020,20120510,20120516,20120520],
          'discharge': [20120510, 20120515, 20121012,20121016,20121023,20120515,20120518,20120522]})
df['admit'] = pd.to_datetime(df['admit'], format='%Y%m%d')
df['discharge'] = pd.to_datetime(df['discharge'], format='%Y%m%d')
df['90DayPostDischarge'] = df['discharge'] + timedelta(days=90)
df

初始df:

    Key     Num1    Num2    admit       discharge   90DayPostDischarge
0   10003   13      121     2012-05-06  2012-05-10  2012-08-08
1   10003   13      122     2012-05-11  2012-05-15  2012-08-13
2   10003   13      122     2012-10-10  2012-10-12  2013-01-10
3   10003   13      124     2012-10-15  2012-10-16  2013-01-14
4   10003   13      125     2012-10-20  2012-10-23  2013-01-21
5   10034   13      126     2012-05-10  2012-05-15  2012-08-13
6   10034   16      127     2012-05-16  2012-05-18  2012-08-16
7   10034   13      128     2012-05-20  2012-05-22  2012-08-20

最终结果:

    Key     Num1    Num2    admit       discharge   90DayPostDischarge Readmit
0   10003   13      121     2012-05-06  2012-05-10  2012-08-08         0        #the first row of every group will be false(0) as it cannot be compared to the previous rows
1   10003   13      122     2012-05-11  2012-05-15  2012-08-13         1        #this qualifies as a readmit to the previous row
2   10003   13      122     2012-10-10  2012-10-12  2013-01-10         0        #this is the same patient but because this row is outside of the previous date ranges, it will be considered a new initial stay
3   10003   13      124     2012-10-15  2012-10-16  2013-01-14         1        #this will be flagged as a readmit to the previous row
4   10003   13      125     2012-10-20  2012-10-23  2013-01-21         1        #this too will be a readmit FOR THE INITIAL STAY AT INDEX 2
5   10034   13      126     2012-05-10  2012-05-15  2012-08-13         0        #the first row of every group will be false(0) as it cannot be compared to the previous rows
6   10034   16      127     2012-05-16  2012-05-18  2012-08-16         0        #this row has a num1 value that is out of the range of 10-15 so it will be flagged as false(0)
7   10034   13      128     2012-05-20  2012-05-22  2012-08-20         1        #this will be flagged as true(1) because of index 5

我的尝试:我首先删除所有不符合第一个条件的行(我意识到制作一个新的df可能不是最好的方法。努力解决这个问题)。其次我尝试标记可能属于第二个条件的行,但是我的代码只返回单个值'true'而不是带有标记列的df。我有一个大脑放屁这种方法。任何帮助将不胜感激。

df2 = df[df['Num1'].isin([10,11,12,13,14,15])]
df2 = df.loc[((df['admit'] > df['discharge'].shift(1)) & \
                  (df['admit'] <= df['90DayPostDischarge'].shift(1))),'readmit'] = 'true'

0 个答案:

没有答案