我有两个数据框,一个保存独特的情节数据记录,另一个保存事件列表。每集有多个事件。我需要遍历剧集数据,找到与每个剧集对应的所有事件,并为新数据帧写入结果事件。大约有 4,000 集和 20,000 个事件。这个过程非常缓慢,因为每一集我都在搜索 20,000 个事件。我猜有一种方法可以通过删除匹配的事件来减少每个循环搜索的事件数量 - 但我不确定。这是我的代码(有额外的过滤来帮助匹配)
for idx, row in episode_df.iterrows():
total_episodes += 1
icu_admission = datetime.strptime(row['ICU_ADM'], '%d/%m/%Y %H:%M:%S')
tmp_df = event_df.loc[event_df['ur'] == row['HRN']]
if ( len(tmp_df.index) < 1):
empty_episodes += 1
continue
# Loop through temp dataframe and write all records with an admission date
# close to icu_admission to new dataframe
for idx_a, row_a in tmp_df.iterrows():
admission = datetime.strptime(row_a['admission'], '%Y-%m-%d %H:%M:%S')
difference = admission - icu_admission
if (abs(difference.total_seconds()) > 14400):
continue
new_df = new_df.append(row_a)
selected_records += 1
数据框的简化版本:
episode_df:
episode_no HRN name ICU_ADM
1 12345 joe date1
2 78124 ann date1
3 98374 bill date2
4 76523 lucy date3
event_df
episode_no ur admission
1 12345 date1
1 12345 date1
1 12345 date5
7 67899 date9
并非所有剧集都有事件,只有有剧集的事件需要复制
答案 0 :(得分:1)
这可以工作:
import pandas as pd
import numpy as np
df1 = pd.DataFrame()
df1['ICU_ADM'] = [pd.to_datetime(f'2020-01-{x}') for x in range(1,10)]
df1['test_day'] = df1['ICU_ADM'].dt.day
df2 = pd.DataFrame()
df2['admission'] = [pd.to_datetime(f'2020-01-{x}') for x in range(2,10,3)]
df2['admission_day'] = df2['admission'].dt.day
df2['random_val'] = np.random.rand(len(df2),1)
pd.merge_asof(df1, df2, left_on=['ICU_ADM'], right_on=['admission'], tolerance=pd.Timedelta('1 day'))