多列熊猫滚动窗口度量标准计算(加快)

时间:2019-01-16 13:07:32

标签: python pandas rolling-computation

我有一些患者记录数据,我想基于这些值来计算指标。该指标是通过三个阈值计算的,每个阈值在当时为该患者的总体评分增加了另一个分数。

问题在于每一列的采样均不均匀,所以我想随时间推移为每位患者提供一个滚动窗口,然后计算指标。

代码细分如下,选择患者,滚动窗口并计算指标,将得分添加到原始数据框中。

这是我当前的代码,但是在大​​型数据集上它非常慢

df['qSOFA_score'] = 0
for subject_id in subject_ids:
    window=df[df['subject_id'] == subject_id].rolling(window_size, min_periods=1, axis=0)
    a = window[BP_label].apply(lambda x: (x<=100).any(), raw=False).fillna(0)
    b = window[RR_label].apply(lambda x: (x >=22).any(), raw=False).fillna(0)
    c = window[GCS_label].apply(lambda x: (x<15.).any(), raw=False).fillna(0)

    df.loc[df['subject_id'] == subject_id, 'qSOFA_score'] = (a + b + c)

以下是一些示例数据:

BP_systolic_(mmHg)  Respiratory_rate_(bpm)  Mental_status_(points)  subject_id
charttime               
2100-06-08 00:18:00 NaN 16.0    NaN 82574.0
2100-06-08 00:19:00 NaN NaN NaN 82574.0
2100-06-08 00:22:00 101.0   NaN NaN 82574.0
2100-06-08 01:00:00 99.0    12.0    NaN 82574.0
2100-06-08 01:08:00 NaN NaN 0.0 82574.0
2100-06-08 02:00:00 107.0   14.0    NaN 82574.0
2100-06-08 02:30:00 NaN 11.0    NaN 82574.0
2100-06-08 02:44:00 NaN 16.0    NaN 82574.0
2100-06-08 02:45:00 101.0   14.0    NaN 82574.0
2100-06-08 03:00:00 97.0    13.0    NaN 82574.0
2100-06-08 03:15:00 NaN 16.0    NaN 82574.0
2100-06-08 03:30:00 NaN 13.0    NaN 82574.0
2100-06-08 03:37:00 94.0    NaN NaN 82574.0
2100-06-08 04:00:00 95.0    11.0    0.0 82574.0
2100-06-08 04:15:00 NaN 15.0    NaN 82574.0
2100-06-08 04:20:00 NaN 19.0    NaN 82574.0
2100-06-08 04:21:00 104.0   NaN NaN 82574.0
2100-06-08 05:00:00 98.0    9.0 NaN 82574.0
2100-06-08 07:00:00 107.0   11.0    NaN 82574.0
2100-06-08 08:00:00 101.0   14.0    NaN 82574.0
2100-06-08 09:00:00 109.0   15.0    NaN 82574.0
2100-06-08 10:00:00 112.0   14.0    NaN 82574.0
2100-06-08 10:30:00 NaN 11.0    NaN 82574.0
2100-06-08 10:33:00 NaN NaN 0.0 82574.0
2100-06-08 11:00:00 102.0   11.0    NaN 82574.0
2100-06-08 12:00:00 103.0   10.0    NaN 82574.0
2100-06-08 13:00:00 112.0   10.0    NaN 82574.0
2100-06-08 14:00:00 NaN 12.0    NaN 82574.0
2100-07-03 00:41:00 124.0   NaN NaN 31585.0
2100-07-03 00:44:00 NaN 17.0    NaN 31585.0
... ... ... ... ...
2209-08-06 22:15:00 109.0   NaN NaN 25723.0
2209-08-06 23:00:00 100.0   22.0    NaN 25723.0
2209-08-07 00:00:00 NaN 32.0    NaN 25723.0
2209-08-07 00:30:00 118.0   NaN NaN 25723.0
2209-08-07 01:00:00 NaN 18.0    NaN 25723.0
2209-08-07 01:15:00 103.0   NaN NaN 25723.0
2209-08-07 02:00:00 97.0    19.0    NaN 25723.0
2209-08-07 03:00:00 NaN 24.0    NaN 25723.0
2209-08-07 03:30:00 98.0    NaN NaN 25723.0
2209-08-07 04:00:00 NaN 19.0    0.0 25723.0
2209-08-07 04:15:00 90.0    NaN NaN 25723.0
2209-08-07 05:00:00 118.0   17.0    NaN 25723.0
2209-08-07 06:00:00 NaN 19.0    NaN 25723.0
2209-08-07 06:30:00 100.0   NaN NaN 25723.0
2209-08-07 07:00:00 NaN 16.0    NaN 25723.0
2209-08-07 07:15:00 94.0    NaN NaN 25723.0
2209-08-07 08:00:00 95.0    17.0    0.0 25723.0
2209-08-07 09:00:00 NaN 17.0    NaN 25723.0
2209-08-07 09:30:00 92.0    NaN NaN 25723.0
2209-08-07 10:00:00 NaN 21.0    NaN 25723.0
2209-08-07 10:15:00 101.0   NaN NaN 25723.0
2209-08-07 11:00:00 113.0   18.0    NaN 25723.0
2209-08-07 12:00:00 NaN 25.0    NaN 25723.0
2209-08-07 12:30:00 96.0    NaN NaN 25723.0
2209-08-07 13:00:00 NaN 19.0    NaN 25723.0
2209-08-07 13:15:00 115.0   NaN NaN 25723.0
2209-08-07 14:00:00 103.0   22.0    NaN 25723.0
2209-08-07 15:36:00 114.0   NaN NaN 25723.0
2209-08-07 15:37:00 NaN 19.0    NaN 25723.0
2209-08-07 16:00:00 NaN 24.0    0.0 25723.0
2265822 rows × 4 columns

有没有更有效的方法?

0 个答案:

没有答案