Pandas:根据来自多个数据帧的信息过滤数据

时间:2016-08-08 17:18:31

标签: pandas

我正在分析临床数据,并试图根据另一个数据框中的信息过滤掉一个数据框中的信息。

其中一个数据框列出了患者接受治疗的日期

dfTreatments = pd.DataFrame({'PatientID': [4,4,4,9,9,9,11,11,11], 'TreatmentDate': ['2016-01-01', '2016-01-15', '2016-03-25','2016-01-01','2016-01-15','2016-01-29','2016-01-01','2016-03-15','2016-03-25']})
dfTreatments['TreatmentDate'] = pd.to_datetime(dfTreatments['TreatmentDate'])

   PatientID TreatmentDate
0          4    2016-01-01
1          4    2016-01-15
2          4    2016-03-25
3          9    2016-01-01
4          9    2016-01-15
5          9    2016-01-29
6         11    2016-01-01
7         11    2016-03-15
8         11    2016-03-25

和其他数据框列出了患者就诊并发病的日期。

dfHospitalVisits = pd.DataFrame({'PatientID': [4,4,9,11], 'HospitalVisitDate': ['2016-01-14','2016-03-10','2016-01-28','2016-01-03']})
dfHospitalVisits['HospitalVisitDate'] = pd.to_datetime(dfHospitalVisits['HospitalVisitDate'])

  HospitalVisitDate  PatientID
0        2016-01-14          4
1        2016-03-10          4
2        2016-01-28          9
3        2016-01-03         11

在我们的研究中,如果患者未接受20天治疗,我们希望从我们的分析中排除医院就诊。我们在20天差距之前的最后一次治疗中开始排除它们。例如:我们将在2016-01-15之后排除患者4的任何住院就诊。

在此示例中,患者4的第二次医院就诊患者11的医院就诊将从dfHospitalVisits中删除。

编辑:@Merlin,到目前为止,我已经使用dfTreatments.groupby('PatientID')['TreatmentDate'].diff()来获取患者分组的治疗日期的差距。我被困的部分是我不知道如何使用大于等于20的治疗日期来过滤dfHospitalVisits中的值。

1 个答案:

答案 0 :(得分:0)

我建议如下:

# Make a sorted dataframe to calculate the time gap before the next treatment 
dfTreatments_sorted = dfTreatments.sort_values(['PatientID','TreatmentDate'], ascending=False)

# Calculate the time gap before the next treatment
df_diff = dfTreatments_sorted.groupby('PatientID').TreatmentDate.diff(periods=1).rename('Gap_before_next_treatment')

# Add the time gaps as a new column to your existing dfTreatments dataframe
dfTreatments = pd.concat([dfTreatments, -df_diff], axis=1, join='inner').sort_index()

# Join dfTreatments and dfHospitalVisits into new dataframe (df)
df = dfHospitalVisits.set_index('PatientID').join(dfTreatments.set_index('PatientID'))

# Select combination where TreatmentDate is before corresponding HospitalVisitDate
df = df[(df.HospitalVisitDate>df.TreatmentDate)]

# The TreatmentDate that is important is latest before the HospitalVisitDate
df = df.reset_index().groupby(['PatientID','HospitalVisitDate']).max()

# Now you can filter hospital visits given the calculated time gap
df = df[df.Gap_before_next_treatment<'20 days'].reset_index()[['PatientID','HospitalVisitDate']]