我有一个由ID和日期组成的数据框。一个ID可能有多个日期-每个ID的ID都按照日期排序。
我的第二个数据帧由ID,开始日期,完成日期,布尔列Accident(指示事故的发生)和Time to event列组成。最后两列最初设置为0。再次对ID以及每个ID的时间间隔进行排序。
我想根据记录的第一个数据帧的事故更新第二个数据帧的两列。如果两个数据帧上都存在该ID(不必),请检查在第二个数据帧的任何时间间隔内是否记录了任何事故。
如果已存在,请找出发生的间隔,将“事故”列更新为1,然后将时间= df1.Date-df2.Start。 如果不是,则将“事故= 0”和“时间= df2.Finish-df2.Start”设置为该患者输入。
我设法通过列表和for循环来做到这一点。但是,我想知道是否有更聪明的方法,因为数据量巨大,并且整个过程要花很多时间。预先感谢!
# Temporary lists
df1list = []
df2list = []
# Change format from dataframe to list
for row in df1.itertuples(index=True, name='Pandas'):
# Get Patient ID and the date of the recorded accident
df1list.append([getattr(row, "Patient"), getattr(row, "regdatum")])
# Change format from dataframe to list
for row in df2.itertuples(index=True, name='Pandas'):
# Get Patient ID, info, occurrence of accident and time to event
df2list.append([getattr(row, "Patient"), getattr(row, "Start"), getattr(row, "Finish"), getattr(row, "Gender"),
getattr(row, "Age"), getattr(row, "Accident"), getattr(row, "Time")])
#For each interval of each patient
for i in range(0, len(df2list)):
#For each recorded accident of each patient
for j in range(0, len(df1list)):
#If there's a match in both lists
if df2list[i][0] == df1list[j][0]:
#If the recorded date is in between the time interval
if (df1list[j][1] >= datetime.strptime(df2list[i][1], '%Y-%m-%d')) & (df1list[j][1] <= datetime.strptime(df2list[i][2], '%Y-%m-%d')):
#Change the accident column to 1 and calculate the time to event
#The extra if is to verify that this is the recorded accident is the first one to have happened within the time interval (if there are multiple, we only keep the first one)
if df2list[i][6] == 0 :
df2list[i][6] = 1
df2list[i][7] = df1list[j][1] - datetime.strptime(df2list[i][1], '%Y-%m-%d')
#Back to dfs
labels = ['Patient', 'Start', 'Finish', 'Gender', 'Age', 'Accident', 'Time']
df = pd.DataFrame.from_records(df2list, columns=labels)
```
答案 0 :(得分:0)
这就是我要怎么做。
# Define a pair of functions that return the list of unique start and end dates for a given patient
def start_dates(patient):
try:
return df2.loc[df2['Patient'] == patient]['Start'].unique()
except:
return np.datetime64("NaT")
def finish_dates(patient):
try:
return df2.loc[df2['Patient'] == patient]['Finish'].unique()
except:
return np.datetime64("NaT")
# Add and fill 'Start' and 'Finish' columns to df1
df1['Start'] = list(zip(df1['Patient'], df1['Accident Date']))
df1['Start'] = df1['Start'].apply(lambda x: max([d for d in start_dates(x[0]) if d <= np.datetime64(x[1])]))
df1['Finish'] = list(zip(df1['Patient'], df1['Accident Date']))
df1['Finish'] = df1['Finish'].apply(lambda x: min([d for d in finish_dates(x[0]) if d >= np.datetime64(x[1])]))
# Merge the two DataFrames
df2 = df2.merge(df1, how='outer')
# Fill the 'Accident' column appropriately, and convert to int
df2['Accident'] = ~pd.isna(df2.iloc[:,5])
df2 = df2.astype({'Accident': int})
# Fill NaT fields in 'Accident Date' with 'Finish'
df2 = df2.fillna({'Accident Date': df2['Finish']})
# Fill 'Time' appropriately
df2['Time'] = df2['Accident Date'] - df2['Start']
# Drop the 'Accident Date' column
df2 = df2.drop(columns=['Accident Date'])
这适用于我创建的一些虚拟数据,我认为它应该适用于您的虚拟数据。我怀疑这是做事的最有效方法(我远非熊猫专家),但我认为通常比使用循环要好。