我有两个数据框:
df = pd.DataFrame({'ID': ['1','1','1','2','2','3','4','4'], \
'ward': ['icu', 'surgery','icu', 'neurology','neurology','obstetrics','OPD', 'surgery'], \
'start_date': ['2016-10-22 18:19:19', '2016-10-24 10:20:00','2016-10-24 12:41:30', '2016-11-09 19:41:30','2016-11-09 23:20:00','2016-11-08 09:45:00','2016-10-15 09:15:00','2016-10-15 12:15:01'], \
'end_date': ['2016-10-24 10:10:19', '2016-10-24 12:40:30','2016-10-26 11:15:00', '2016-11-09 22:11:00','2016-11-11 13:30:00','2016-11-09 07:25:00','2016-10-15 12:15:00','2016-10-17 17:25:00'] })
df1 = pd.DataFrame({'ID': ['1','2','4'], \
'ward': ['radiology', 'rehabilitation','radiology'], \
'date': ['2016-10-23 10:50:00', '2016-11-24 10:20:00','2016-10-15 18:41:30']})
我想通过比较ID以及df1
中的df
是否落在date
之间来比较df1
中显示的数据到start_date
中和end_date
中的df
。如果两个条件都匹配,我想在df1
中为该特定ID添加另一行(数据来自df
)。在添加新行的地方,我还想更改上一行和下一行的日期/时间。
以下是最终结果:
ID ward start_date end_date
0 1 icu 2016-10-22 18:19:19 2016-10-23 10:50:00
1 1 radiology 2016-10-23 10:50:00 2016-10-23 10:50:00
2 1 icu 2016-10-23 10:50:00 2016-10-24 10:10:19
3 1 surgery 2016-10-24 10:20:00 2016-10-24 12:40:30
4 1 icu 2016-10-24 12:41:30 2016-10-26 11:15:00
5 2 neurology 2016-11-09 19:41:30 2016-11-09 22:11:00
6 2 neurology 2016-11-09 23:20:00 2016-11-11 13:30:00
7 3 obstetrics 2016-11-08 09:45:00 2016-11-09 07:25:00
8 4 OPD 2016-10-15 09:15:00 2016-10-15 12:15:00
9 4 hematology 2016-10-15 12:15:00 2016-10-15 18:41:30
10 4 radiology 2016-10-15 18:41:30 2016-10-15 18:41:30
11 4 hematology 2016-10-15 18:41:30 2016-10-17 17:25:00
在此示例中,ID 1和ID 4在两个数据帧中均满足条件。仅说明ID 1的示例,最初ID 1从icu->手术-> icu,但在比较并填充新行后,最终数据显示ID 1从icu->放射学-> icu->手术-> ICU。现在ID 1由5行而不是3行组成,并且每行中的start_date和end_date也会更新。
数据集(df)很大,包括100万行,我不知道应该使用哪种方法来有效地获得正确的结果。任何帮助将不胜感激。
答案 0 :(得分:1)
通过解释here的指导,我可以采用以下方法:
import pandas as pd
df = pd.DataFrame({'ID': ['1','1','1','2','2','3','4','4'], \
'ward': ['icu', 'surgery','icu', 'neurology','neurology','obstetrics','OPD', 'surgery'], \
'start_date': ['2016-10-22 18:19:19', '2016-10-24 10:20:00','2016-10-24 12:41:30', '2016-11-09 19:41:30','2016-11-09 23:20:00','2016-11-08 09:45:00','2016-10-15 09:15:00','2016-10-15 12:15:01'], \
'end_date': ['2016-10-24 10:10:19', '2016-10-24 12:40:30','2016-10-26 11:15:00', '2016-11-09 22:11:00','2016-11-11 13:30:00','2016-11-09 07:25:00','2016-10-15 12:15:00','2016-10-17 17:25:00'] })
df1 = pd.DataFrame({'ID': ['1','2','4'], \
'ward': ['radiology', 'rehabilitation','radiology'], \
'date': ['2016-10-23 10:50:00', '2016-11-24 10:20:00','2016-10-15 18:41:30']})
# Converting str datetime to datetime objects
df.start_date = pd.to_datetime(df.start_date)
df.end_date = pd.to_datetime(df.end_date)
df1.date = pd.to_datetime(df1.date)
# Change the index to intervals
df_temp = df.copy()
df_temp.index = pd.IntervalIndex.from_arrays(df_temp['start_date'],df_temp['end_date'],closed='both')
# Find the interval to split
def find_interval(row):
try:
return df_temp.loc[row.date].loc[(df_temp.ID == row.ID)].iloc[0]
except KeyError:
# This value does not fall within any interval in df
return
# These are all the rows to be altered:
to_remove = df1.apply(find_interval, axis=1).dropna()
"""
to_remove
ID ward start_date end_date
0 1 icu 2016-10-22 18:19:19 2016-10-24 10:10:19
2 4 surgery 2016-10-15 12:15:01 2016-10-17 17:25:00 """
# Create 3 new rows for every matching
def new_rows(row):
try:
# Create the new rows by taking information from the existing row
existing = df_temp.loc[row.date].loc[(df_temp.ID == row.ID)].iloc[0]
out = pd.DataFrame(dict(
ID=[row.ID] * 3,
ward=[existing.ward, row.ward, existing.ward],
start_date=[existing.start_date, row.date, row.date],
end_date=[row.date, row.date, existing.end_date]
))
return out
except KeyError:
return
to_add = pd.concat(df1.apply(new_rows, axis=1).values)
"""
to_add
ID ward start_date end_date
0 1 icu 2016-10-22 18:19:19 2016-10-23 10:50:00
1 1 radiology 2016-10-23 10:50:00 2016-10-23 10:50:00
2 1 icu 2016-10-23 10:50:00 2016-10-24 10:10:19
0 4 surgery 2016-10-15 12:15:01 2016-10-15 18:41:30
1 4 radiology 2016-10-15 18:41:30 2016-10-15 18:41:30
2 4 surgery 2016-10-15 18:41:30 2016-10-17 17:25:00 """
# Remove the 'to_remove'
new = pd.concat([df,to_remove]).drop_duplicates(keep=False)
# Add the 'to_add'
new = pd.concat([new, to_add])
# Sort the finished dataframe
new = new.sort_values(['ID', 'start_date']).reset_index(drop=True)
new
ID ward start_date end_date
0 1 icu 2016-10-22 18:19:19 2016-10-23 10:50:00
1 1 radiology 2016-10-23 10:50:00 2016-10-23 10:50:00
2 1 icu 2016-10-23 10:50:00 2016-10-24 10:10:19
3 1 surgery 2016-10-24 10:20:00 2016-10-24 12:40:30
4 1 icu 2016-10-24 12:41:30 2016-10-26 11:15:00
5 2 neurology 2016-11-09 19:41:30 2016-11-09 22:11:00
6 2 neurology 2016-11-09 23:20:00 2016-11-11 13:30:00
7 3 obstetrics 2016-11-08 09:45:00 2016-11-09 07:25:00
8 4 OPD 2016-10-15 09:15:00 2016-10-15 12:15:00
9 4 surgery 2016-10-15 12:15:01 2016-10-15 18:41:30
10 4 radiology 2016-10-15 18:41:30 2016-10-15 18:41:30
11 4 surgery 2016-10-15 18:41:30 2016-10-17 17:25:00