我有一些患者记录数据,我想基于这些值来计算指标。该指标是通过三个阈值计算的,每个阈值在当时为该患者的总体评分增加了另一个分数。
问题在于每一列的采样均不均匀,所以我想随时间推移为每位患者提供一个滚动窗口,然后计算指标。
代码细分如下,选择患者,滚动窗口并计算指标,将得分添加到原始数据框中。
这是我当前的代码,但是在大型数据集上它非常慢。
df['qSOFA_score'] = 0
for subject_id in subject_ids:
window=df[df['subject_id'] == subject_id].rolling(window_size, min_periods=1, axis=0)
a = window[BP_label].apply(lambda x: (x<=100).any(), raw=False).fillna(0)
b = window[RR_label].apply(lambda x: (x >=22).any(), raw=False).fillna(0)
c = window[GCS_label].apply(lambda x: (x<15.).any(), raw=False).fillna(0)
df.loc[df['subject_id'] == subject_id, 'qSOFA_score'] = (a + b + c)
以下是一些示例数据:
BP_systolic_(mmHg) Respiratory_rate_(bpm) Mental_status_(points) subject_id
charttime
2100-06-08 00:18:00 NaN 16.0 NaN 82574.0
2100-06-08 00:19:00 NaN NaN NaN 82574.0
2100-06-08 00:22:00 101.0 NaN NaN 82574.0
2100-06-08 01:00:00 99.0 12.0 NaN 82574.0
2100-06-08 01:08:00 NaN NaN 0.0 82574.0
2100-06-08 02:00:00 107.0 14.0 NaN 82574.0
2100-06-08 02:30:00 NaN 11.0 NaN 82574.0
2100-06-08 02:44:00 NaN 16.0 NaN 82574.0
2100-06-08 02:45:00 101.0 14.0 NaN 82574.0
2100-06-08 03:00:00 97.0 13.0 NaN 82574.0
2100-06-08 03:15:00 NaN 16.0 NaN 82574.0
2100-06-08 03:30:00 NaN 13.0 NaN 82574.0
2100-06-08 03:37:00 94.0 NaN NaN 82574.0
2100-06-08 04:00:00 95.0 11.0 0.0 82574.0
2100-06-08 04:15:00 NaN 15.0 NaN 82574.0
2100-06-08 04:20:00 NaN 19.0 NaN 82574.0
2100-06-08 04:21:00 104.0 NaN NaN 82574.0
2100-06-08 05:00:00 98.0 9.0 NaN 82574.0
2100-06-08 07:00:00 107.0 11.0 NaN 82574.0
2100-06-08 08:00:00 101.0 14.0 NaN 82574.0
2100-06-08 09:00:00 109.0 15.0 NaN 82574.0
2100-06-08 10:00:00 112.0 14.0 NaN 82574.0
2100-06-08 10:30:00 NaN 11.0 NaN 82574.0
2100-06-08 10:33:00 NaN NaN 0.0 82574.0
2100-06-08 11:00:00 102.0 11.0 NaN 82574.0
2100-06-08 12:00:00 103.0 10.0 NaN 82574.0
2100-06-08 13:00:00 112.0 10.0 NaN 82574.0
2100-06-08 14:00:00 NaN 12.0 NaN 82574.0
2100-07-03 00:41:00 124.0 NaN NaN 31585.0
2100-07-03 00:44:00 NaN 17.0 NaN 31585.0
... ... ... ... ...
2209-08-06 22:15:00 109.0 NaN NaN 25723.0
2209-08-06 23:00:00 100.0 22.0 NaN 25723.0
2209-08-07 00:00:00 NaN 32.0 NaN 25723.0
2209-08-07 00:30:00 118.0 NaN NaN 25723.0
2209-08-07 01:00:00 NaN 18.0 NaN 25723.0
2209-08-07 01:15:00 103.0 NaN NaN 25723.0
2209-08-07 02:00:00 97.0 19.0 NaN 25723.0
2209-08-07 03:00:00 NaN 24.0 NaN 25723.0
2209-08-07 03:30:00 98.0 NaN NaN 25723.0
2209-08-07 04:00:00 NaN 19.0 0.0 25723.0
2209-08-07 04:15:00 90.0 NaN NaN 25723.0
2209-08-07 05:00:00 118.0 17.0 NaN 25723.0
2209-08-07 06:00:00 NaN 19.0 NaN 25723.0
2209-08-07 06:30:00 100.0 NaN NaN 25723.0
2209-08-07 07:00:00 NaN 16.0 NaN 25723.0
2209-08-07 07:15:00 94.0 NaN NaN 25723.0
2209-08-07 08:00:00 95.0 17.0 0.0 25723.0
2209-08-07 09:00:00 NaN 17.0 NaN 25723.0
2209-08-07 09:30:00 92.0 NaN NaN 25723.0
2209-08-07 10:00:00 NaN 21.0 NaN 25723.0
2209-08-07 10:15:00 101.0 NaN NaN 25723.0
2209-08-07 11:00:00 113.0 18.0 NaN 25723.0
2209-08-07 12:00:00 NaN 25.0 NaN 25723.0
2209-08-07 12:30:00 96.0 NaN NaN 25723.0
2209-08-07 13:00:00 NaN 19.0 NaN 25723.0
2209-08-07 13:15:00 115.0 NaN NaN 25723.0
2209-08-07 14:00:00 103.0 22.0 NaN 25723.0
2209-08-07 15:36:00 114.0 NaN NaN 25723.0
2209-08-07 15:37:00 NaN 19.0 NaN 25723.0
2209-08-07 16:00:00 NaN 24.0 0.0 25723.0
2265822 rows × 4 columns
有没有更有效的方法?