我正在分析临床数据,并试图根据另一个数据框中的信息过滤掉一个数据框中的信息。
其中一个数据框列出了患者接受治疗的日期
dfTreatments = pd.DataFrame({'PatientID': [4,4,4,9,9,9,11,11,11], 'TreatmentDate': ['2016-01-01', '2016-01-15', '2016-03-25','2016-01-01','2016-01-15','2016-01-29','2016-01-01','2016-03-15','2016-03-25']})
dfTreatments['TreatmentDate'] = pd.to_datetime(dfTreatments['TreatmentDate'])
PatientID TreatmentDate
0 4 2016-01-01
1 4 2016-01-15
2 4 2016-03-25
3 9 2016-01-01
4 9 2016-01-15
5 9 2016-01-29
6 11 2016-01-01
7 11 2016-03-15
8 11 2016-03-25
和其他数据框列出了患者就诊并发病的日期。
dfHospitalVisits = pd.DataFrame({'PatientID': [4,4,9,11], 'HospitalVisitDate': ['2016-01-14','2016-03-10','2016-01-28','2016-01-03']})
dfHospitalVisits['HospitalVisitDate'] = pd.to_datetime(dfHospitalVisits['HospitalVisitDate'])
HospitalVisitDate PatientID
0 2016-01-14 4
1 2016-03-10 4
2 2016-01-28 9
3 2016-01-03 11
在我们的研究中,如果患者未接受20天治疗,我们希望从我们的分析中排除医院就诊。我们在20天差距之前的最后一次治疗中开始排除它们。例如:我们将在2016-01-15之后排除患者4的任何住院就诊。
在此示例中,患者4的第二次医院就诊和患者11的医院就诊将从dfHospitalVisits中删除。
编辑:@Merlin,到目前为止,我已经使用dfTreatments.groupby('PatientID')['TreatmentDate'].diff()
来获取患者分组的治疗日期的差距。我被困的部分是我不知道如何使用大于等于20的治疗日期来过滤dfHospitalVisits中的值。
答案 0 :(得分:0)
我建议如下:
# Make a sorted dataframe to calculate the time gap before the next treatment
dfTreatments_sorted = dfTreatments.sort_values(['PatientID','TreatmentDate'], ascending=False)
# Calculate the time gap before the next treatment
df_diff = dfTreatments_sorted.groupby('PatientID').TreatmentDate.diff(periods=1).rename('Gap_before_next_treatment')
# Add the time gaps as a new column to your existing dfTreatments dataframe
dfTreatments = pd.concat([dfTreatments, -df_diff], axis=1, join='inner').sort_index()
# Join dfTreatments and dfHospitalVisits into new dataframe (df)
df = dfHospitalVisits.set_index('PatientID').join(dfTreatments.set_index('PatientID'))
# Select combination where TreatmentDate is before corresponding HospitalVisitDate
df = df[(df.HospitalVisitDate>df.TreatmentDate)]
# The TreatmentDate that is important is latest before the HospitalVisitDate
df = df.reset_index().groupby(['PatientID','HospitalVisitDate']).max()
# Now you can filter hospital visits given the calculated time gap
df = df[df.Gap_before_next_treatment<'20 days'].reset_index()[['PatientID','HospitalVisitDate']]